Queries about input and output files, running specific calculations, etc.
Moderators: Global Moderator, Moderator
-
cuikun_lin
- Newbie
- Posts: 4
- Joined: Wed Mar 03, 2021 10:18 pm
#1
Post
by cuikun_lin » Wed Mar 03, 2021 11:55 pm
We have a cluster of 240 nodes connected via a 200 Gbps IB network. Each node is a 128 core AMD EPYC 7702 chip. For many applications (like lammps) it is important to bind the MPI tasks to the l3cache (32 MPI tasks) and then have each MPI task create 4 threads to get good performance. Here is how the binding is done for lammps (two different ways).
METHOD 1
Code: Select all
mpirun -np 256 --bind-to core --map-by hwthread -use-hwthread-cpus -mca btl vader,self lmp -var r 1000 -in in.rhodo -sf omp
METHOD 2
Code: Select all
export OMP_NUM_THREADS=4
mpirun --mca btl self,vader --map-by l3cache lmp -var r 1000 -in in.rhodo -sf omp
Can something similar be done for VASP? We built VASP with openMP support but we cannot get the binding and threading to work.
Thanks
-
merzuk.kaltak
- Administrator
- Posts: 295
- Joined: Mon Sep 24, 2018 9:39 am
#2
Post
by merzuk.kaltak » Thu Mar 04, 2021 1:01 pm
Currently we have AMD EPYC chips on nodes connected only via 1Gbps.
As a such, we can't test multi-node processor pinning and thread launching in practice yet.
Concerning MPI+OpenMP on a single node:
I have tried only MPI-parallelization with an EPYC 7402P, where the mpirun option "--map-by core" described in this
AMD tuning guide was sufficient.
MPI + OpenMP threading is explained on our
wiki page in general. The idea is that you want to have the MPI ranks that launches threads on the same node or (even better) on the same socket.
In the case of EPYC chips, same socket would mean even same chiplett.
-
cuikun_lin
- Newbie
- Posts: 4
- Joined: Wed Mar 03, 2021 10:18 pm
#3
Post
by cuikun_lin » Sun Mar 07, 2021 2:56 pm
Ran VASP with four different mpirun setups. Here are the associated timings.
Code: Select all
time -p mpirun vasp_std
286.583 seconds
time -p mpirun --bind-to core vasp_std
287.094 seconds
(1) time -p mpirun --map-by core --report-bindings --mca pml ucx --mca osc ucx \
--mca coll_hcoll_enable 1 -x UCX_NET_DEVICES=mlx5_2:1 -x \
HCOLL_MAIN_IB=mlx5_2:1 vasp_std
10222.636 seconds
(2) mpirun -np 32 --map-by l3cache:PE=4 --bind-to core \
-x OMP_NUM_THREADS=4 -x OMP_STACKSIZE=512m \
-x KMP_AFFINITY=verbose,granularity=fine,compact,1,0 \
vasp_std
415.790 seconds
Not sure why the suggested mpirun (1) in the AMD tuning guide for 7002 processors is performing so badly. When I run the threaded version (2) what should I see when I do top? I expected to see 32 mpi tasks each using approximately 400% of CPUs but didn't.
Thanks
-
thda0531
- Newbie
- Posts: 5
- Joined: Tue Apr 22, 2008 7:00 am
- License Nr.: 18
#4
Post
by thda0531 » Tue Mar 16, 2021 9:42 am
Hi cuikun_lin,
sorry to be completely off topic, but can you share your settings in
makefile.include with us for building VASP on AMD Epyc efficiently?
Thank you in advance.
-
cuikun_lin
- Newbie
- Posts: 4
- Joined: Wed Mar 03, 2021 10:18 pm
#5
Post
by cuikun_lin » Sat Mar 20, 2021 12:39 am
Thda0531,
All the heavy lifting were done by our HPCC staffs and they did a lot of work on optimizing the clusters. I believe they are still optimizing to get more efficient CPU time.
For the makefile, I tried different ones from VASP wiki and they works very well following their instructions.
For AMD compiler options, like O1, O2, O3 or Ofast etc, please see this documents.
https://www.amd.com/system/files/docume ... essors.pdf
With limited test of GNU version, O3 an Ofast are pretty good. In fact NERSC also did extensive benchmark tests. Please see the following document.
https://www.nersc.gov/assets/Uploads/Co ... 90212.pptx
Hope this will help.