VASP NCCL + OpenACC + OpenMP
Moderators: Global Moderator, Moderator
-
- Newbie
- Posts: 5
- Joined: Thu Feb 02, 2023 11:27 am
VASP NCCL + OpenACC + OpenMP
Dear VASP developers,
When running VASP NCCL on one GPU using one OpenMP thread, it works fine and completes a single-point calculation within about one minute. The same system with multiple OpenMP threads (but still on one GPU), vasp_std gets stuck almost in the beginning and the last text in OUTCAR is
First call to EWALD: gamma= 0.156
Maximum number of real-space cells 3x 3x 3
Maximum number of reciprocal cells 3x 3x 3
FEWALD: cpu time 0.0617: real time 0.0043
I have used a Makefile almost identical to Makefile.include.nvhpc_ompi_mkl_omp_acc as proposed on wiki page:
wiki/index.php/OpenACC_GPU_port_of_VASP
I let it go much longer and no progress and no more output in OUTCAR.
VASP version: 6.3.2
Compiler: NVHPC 21.11
I would appreciate any help!
Best regards
Ghasemi
When running VASP NCCL on one GPU using one OpenMP thread, it works fine and completes a single-point calculation within about one minute. The same system with multiple OpenMP threads (but still on one GPU), vasp_std gets stuck almost in the beginning and the last text in OUTCAR is
First call to EWALD: gamma= 0.156
Maximum number of real-space cells 3x 3x 3
Maximum number of reciprocal cells 3x 3x 3
FEWALD: cpu time 0.0617: real time 0.0043
I have used a Makefile almost identical to Makefile.include.nvhpc_ompi_mkl_omp_acc as proposed on wiki page:
wiki/index.php/OpenACC_GPU_port_of_VASP
I let it go much longer and no progress and no more output in OUTCAR.
VASP version: 6.3.2
Compiler: NVHPC 21.11
I would appreciate any help!
Best regards
Ghasemi
-
- Global Moderator
- Posts: 319
- Joined: Mon Sep 13, 2021 12:45 pm
Re: VASP NCCL + OpenACC + OpenMP
Dear ghasemi,
How many OMP threads did you use in this calculation?
Can you run this calculation with multiple OMP threads but without GPUs?
How many OMP threads did you use in this calculation?
Can you run this calculation with multiple OMP threads but without GPUs?
-
- Newbie
- Posts: 5
- Joined: Thu Feb 02, 2023 11:27 am
Re: VASP NCCL + OpenACC + OpenMP
Dear Alexey,
I have tried different number of OMP threads, e.g. 2 and 4 and 16, where the last is the maximum I can use for the allocation of single-GPU. Certainly, I have used one MPI process as it must be done when running with NCCL.
Yes, The same version of VASP for the same input files like INCAR,POSCAR, etc has been tested in hybrid mode, MPI+OpenMP, running with various number of MPI processes and OpenMP threads. However, the binary was with build with intel.
I will test it with NVHPC without GPU and post again.
I have tried different number of OMP threads, e.g. 2 and 4 and 16, where the last is the maximum I can use for the allocation of single-GPU. Certainly, I have used one MPI process as it must be done when running with NCCL.
Yes, The same version of VASP for the same input files like INCAR,POSCAR, etc has been tested in hybrid mode, MPI+OpenMP, running with various number of MPI processes and OpenMP threads. However, the binary was with build with intel.
I will test it with NVHPC without GPU and post again.
-
- Global Moderator
- Posts: 319
- Joined: Mon Sep 13, 2021 12:45 pm
Re: VASP NCCL + OpenACC + OpenMP
Thank you. It would also be a good idea to test it with a more recent version of NVHPC. NVHPC 21.11 was release in 2021.
-
- Newbie
- Posts: 5
- Joined: Thu Feb 02, 2023 11:27 am
Re: VASP NCCL + OpenACC + OpenMP
Rebuilding VASP with NVHPC 21.11 without OpenACC, the same system runs fine (however, slow as expected) without GPU with 1 and 16 threads.
In another test, I used NVHPC 22.5 with OpenACC, the same system runs fine with GPU using 1 and 16 OpenMP threads. However, in this binary, I did not link to HDF5, therefore, it may be a problem related to the compiler or due to the link to HDF5.
How much work is left for CPU when running on GPU?
How much gain in performance one can expect when using more than one OpenMP threads?
The majority of GPU-enabled VASP benchmarks focus on speedup as a function of the number of GPUs or compare NCCL with no NCCL?
In another test, I used NVHPC 22.5 with OpenACC, the same system runs fine with GPU using 1 and 16 OpenMP threads. However, in this binary, I did not link to HDF5, therefore, it may be a problem related to the compiler or due to the link to HDF5.
How much work is left for CPU when running on GPU?
How much gain in performance one can expect when using more than one OpenMP threads?
The majority of GPU-enabled VASP benchmarks focus on speedup as a function of the number of GPUs or compare NCCL with no NCCL?
-
- Global Moderator
- Posts: 319
- Joined: Mon Sep 13, 2021 12:45 pm
Re: VASP NCCL + OpenACC + OpenMP
Looks like a compiler issue. I doubt that HDF5 is a problem here, but it can be easily tested.However, in this binary, I did not link to HDF5, therefore, it may be a problem related to the compiler or due to the link to HDF5.
We usually do our performance tests with NCCL, which can yield a performance gain of 20-30% for our standard electronic minimization calculation thanks to the asynchronous communication. However, NCCL can only handle one MPI rank per GPU, which means that on a multicore CPU only one core is being used. To improve the situation one can use multiple OpenMP threads to increase the utilization of the CPU cores. But one should keep in mind that all the heavy parts of the calculation are ported to GPUs, so the performance gain from using OpenMP threads is usually not very large.How much work is left for CPU when running on GPU?
How much gain in performance one can expect when using more than one OpenMP threads?
The majority of GPU-enabled VASP benchmarks focus on speedup as a function of the number of GPUs or compare NCCL with no NCCL?
-
- Newbie
- Posts: 5
- Joined: Thu Feb 02, 2023 11:27 am
Re: VASP NCCL + OpenACC + OpenMP
Thanks for the reply.
-
- Newbie
- Posts: 7
- Joined: Thu Jul 27, 2023 3:59 pm
Re: VASP NCCL + OpenACC + OpenMP
Your reply here is super useful and instructive. Thanks a lot, Alexey.alexey.tal wrote: ↑Wed Aug 09, 2023 9:55 amLooks like a compiler issue. I doubt that HDF5 is a problem here, but it can be easily tested.However, in this binary, I did not link to HDF5, therefore, it may be a problem related to the compiler or due to the link to HDF5.
We usually do our performance tests with NCCL, which can yield a performance gain of 20-30% for our standard electronic minimization calculation thanks to the asynchronous communication. However, NCCL can only handle one MPI rank per GPU, which means that on a multicore CPU only one core is being used. To improve the situation one can use multiple OpenMP threads to increase the utilization of the CPU cores. But one should keep in mind that all the heavy parts of the calculation are ported to GPUs, so the performance gain from using OpenMP threads is usually not very large.How much work is left for CPU when running on GPU?
How much gain in performance one can expect when using more than one OpenMP threads?
The majority of GPU-enabled VASP benchmarks focus on speedup as a function of the number of GPUs or compare NCCL with no NCCL?