VASP NCCL + OpenACC + OpenMP

Questions regarding the compilation of VASP on various platforms: hardware, compilers and libraries, etc.


Moderators: Global Moderator, Moderator

Post Reply
Message
Author
ghasemi
Newbie
Newbie
Posts: 5
Joined: Thu Feb 02, 2023 11:27 am

VASP NCCL + OpenACC + OpenMP

#1 Post by ghasemi » Tue Aug 08, 2023 7:17 am

Dear VASP developers,

When running VASP NCCL on one GPU using one OpenMP thread, it works fine and completes a single-point calculation within about one minute. The same system with multiple OpenMP threads (but still on one GPU), vasp_std gets stuck almost in the beginning and the last text in OUTCAR is

First call to EWALD: gamma= 0.156
Maximum number of real-space cells 3x 3x 3
Maximum number of reciprocal cells 3x 3x 3

FEWALD: cpu time 0.0617: real time 0.0043


I have used a Makefile almost identical to Makefile.include.nvhpc_ompi_mkl_omp_acc as proposed on wiki page:
wiki/index.php/OpenACC_GPU_port_of_VASP

I let it go much longer and no progress and no more output in OUTCAR.

VASP version: 6.3.2
Compiler: NVHPC 21.11

I would appreciate any help!

Best regards
Ghasemi

alexey.tal
Global Moderator
Global Moderator
Posts: 319
Joined: Mon Sep 13, 2021 12:45 pm

Re: VASP NCCL + OpenACC + OpenMP

#2 Post by alexey.tal » Tue Aug 08, 2023 11:58 am

Dear ghasemi,

How many OMP threads did you use in this calculation?
Can you run this calculation with multiple OMP threads but without GPUs?

ghasemi
Newbie
Newbie
Posts: 5
Joined: Thu Feb 02, 2023 11:27 am

Re: VASP NCCL + OpenACC + OpenMP

#3 Post by ghasemi » Tue Aug 08, 2023 12:21 pm

Dear Alexey,

I have tried different number of OMP threads, e.g. 2 and 4 and 16, where the last is the maximum I can use for the allocation of single-GPU. Certainly, I have used one MPI process as it must be done when running with NCCL.
Yes, The same version of VASP for the same input files like INCAR,POSCAR, etc has been tested in hybrid mode, MPI+OpenMP, running with various number of MPI processes and OpenMP threads. However, the binary was with build with intel.
I will test it with NVHPC without GPU and post again.

alexey.tal
Global Moderator
Global Moderator
Posts: 319
Joined: Mon Sep 13, 2021 12:45 pm

Re: VASP NCCL + OpenACC + OpenMP

#4 Post by alexey.tal » Tue Aug 08, 2023 12:31 pm

Thank you. It would also be a good idea to test it with a more recent version of NVHPC. NVHPC 21.11 was release in 2021.

ghasemi
Newbie
Newbie
Posts: 5
Joined: Thu Feb 02, 2023 11:27 am

Re: VASP NCCL + OpenACC + OpenMP

#5 Post by ghasemi » Wed Aug 09, 2023 7:56 am

Rebuilding VASP with NVHPC 21.11 without OpenACC, the same system runs fine (however, slow as expected) without GPU with 1 and 16 threads.
In another test, I used NVHPC 22.5 with OpenACC, the same system runs fine with GPU using 1 and 16 OpenMP threads. However, in this binary, I did not link to HDF5, therefore, it may be a problem related to the compiler or due to the link to HDF5.

How much work is left for CPU when running on GPU?
How much gain in performance one can expect when using more than one OpenMP threads?
The majority of GPU-enabled VASP benchmarks focus on speedup as a function of the number of GPUs or compare NCCL with no NCCL?

alexey.tal
Global Moderator
Global Moderator
Posts: 319
Joined: Mon Sep 13, 2021 12:45 pm

Re: VASP NCCL + OpenACC + OpenMP

#6 Post by alexey.tal » Wed Aug 09, 2023 9:55 am

However, in this binary, I did not link to HDF5, therefore, it may be a problem related to the compiler or due to the link to HDF5.
Looks like a compiler issue. I doubt that HDF5 is a problem here, but it can be easily tested.
How much work is left for CPU when running on GPU?
How much gain in performance one can expect when using more than one OpenMP threads?
The majority of GPU-enabled VASP benchmarks focus on speedup as a function of the number of GPUs or compare NCCL with no NCCL?
We usually do our performance tests with NCCL, which can yield a performance gain of 20-30% for our standard electronic minimization calculation thanks to the asynchronous communication. However, NCCL can only handle one MPI rank per GPU, which means that on a multicore CPU only one core is being used. To improve the situation one can use multiple OpenMP threads to increase the utilization of the CPU cores. But one should keep in mind that all the heavy parts of the calculation are ported to GPUs, so the performance gain from using OpenMP threads is usually not very large.

ghasemi
Newbie
Newbie
Posts: 5
Joined: Thu Feb 02, 2023 11:27 am

Re: VASP NCCL + OpenACC + OpenMP

#7 Post by ghasemi » Wed Aug 09, 2023 10:49 am

Thanks for the reply.

guorong_weng
Newbie
Newbie
Posts: 7
Joined: Thu Jul 27, 2023 3:59 pm

Re: VASP NCCL + OpenACC + OpenMP

#8 Post by guorong_weng » Thu Aug 10, 2023 8:42 pm

alexey.tal wrote: Wed Aug 09, 2023 9:55 am
However, in this binary, I did not link to HDF5, therefore, it may be a problem related to the compiler or due to the link to HDF5.
Looks like a compiler issue. I doubt that HDF5 is a problem here, but it can be easily tested.
How much work is left for CPU when running on GPU?
How much gain in performance one can expect when using more than one OpenMP threads?
The majority of GPU-enabled VASP benchmarks focus on speedup as a function of the number of GPUs or compare NCCL with no NCCL?
We usually do our performance tests with NCCL, which can yield a performance gain of 20-30% for our standard electronic minimization calculation thanks to the asynchronous communication. However, NCCL can only handle one MPI rank per GPU, which means that on a multicore CPU only one core is being used. To improve the situation one can use multiple OpenMP threads to increase the utilization of the CPU cores. But one should keep in mind that all the heavy parts of the calculation are ported to GPUs, so the performance gain from using OpenMP threads is usually not very large.
Your reply here is super useful and instructive. Thanks a lot, Alexey.

Post Reply