Compiling with AOCC/AOCL OpenMPI SGRGEN error
Moderators: Global Moderator, Moderator
-
- Newbie
- Posts: 16
- Joined: Fri Oct 20, 2023 1:13 pm
Compiling with AOCC/AOCL OpenMPI SGRGEN error
Dear knowledgeable vasp users,
My system admin an I are trying some new things with VASP. He got a new node to try out with large cache AMD server chips (2x AMD EPYC 9684X 96-Cores per node, with each CPU having 1152MB L3 cache). We wanted to test how VASP simulations scale on this machine and compare it to our local cluster (2x AMD EPYC 9654 96-Cores, with each CPU having 384MB L3 cache). In the end, this is just a comparison in how VASP utilizes cache and what it does for its efficiency.
Besides compiling it the traditional FOSS way (GCC + OpenMPI + OpenBlas + Netlib-Scalapac + FFTW, which worked fine and performed better on the larger cache chips), we also wanted to see how the AOCC and AOCL compiler and math libraries and OpenMPI would change its performance. We assume that these should be better in taking advantage of the large cache amounts. However, although our compilation reports that it finishes successfully, we do get crashes when we try to run our example simulation (```VERY BAD NEWS! internal error in subroutine SGRGEN: Too many elements 49 ----> I REFUSE TO CONTINUE WITH THIS SICK JOB ... BYE!!! <---- ```). Looking around, we see that there were some forum discussions on this topic before (see: https://wwww.vasp.at/forum/viewtopic.ph ... GEN#p17814, or https://wwww.vasp.at/forum/viewtopic.ph ... GEN#p17686). However, the solution links to a webpage that is not available anymore. (http://cms.mpi.univie.ac.at/vasp-forum/ ... GEN#p17686). As it is indicated that the issues might be caused by the MPI implementation, we are currently rechecking the compilation and the environment variables during runtime. I will send the exact compiler versions and settings, the makefile.include, as well as the compile output and the crashed simulation output later today. I assume that this is necessary to solve these issues.
On another note. We are using a small AIMD example study to assess the performance. However, this uses 374 atoms and has a single gamma-point simulation. Does someone have another real system to check the performance difference with (about two hours runtime on 4 cores but of course faster when we try with 192 cores) to check if the cache matters? Preferably something which is not AIMD (but pure DFT or structure optimization), with more k-points and which is already somewhat optimized in the INCAR for larger number of cores. I will share the results afterwards.
Regards,
Jelle Lagerweij
My system admin an I are trying some new things with VASP. He got a new node to try out with large cache AMD server chips (2x AMD EPYC 9684X 96-Cores per node, with each CPU having 1152MB L3 cache). We wanted to test how VASP simulations scale on this machine and compare it to our local cluster (2x AMD EPYC 9654 96-Cores, with each CPU having 384MB L3 cache). In the end, this is just a comparison in how VASP utilizes cache and what it does for its efficiency.
Besides compiling it the traditional FOSS way (GCC + OpenMPI + OpenBlas + Netlib-Scalapac + FFTW, which worked fine and performed better on the larger cache chips), we also wanted to see how the AOCC and AOCL compiler and math libraries and OpenMPI would change its performance. We assume that these should be better in taking advantage of the large cache amounts. However, although our compilation reports that it finishes successfully, we do get crashes when we try to run our example simulation (```VERY BAD NEWS! internal error in subroutine SGRGEN: Too many elements 49 ----> I REFUSE TO CONTINUE WITH THIS SICK JOB ... BYE!!! <---- ```). Looking around, we see that there were some forum discussions on this topic before (see: https://wwww.vasp.at/forum/viewtopic.ph ... GEN#p17814, or https://wwww.vasp.at/forum/viewtopic.ph ... GEN#p17686). However, the solution links to a webpage that is not available anymore. (http://cms.mpi.univie.ac.at/vasp-forum/ ... GEN#p17686). As it is indicated that the issues might be caused by the MPI implementation, we are currently rechecking the compilation and the environment variables during runtime. I will send the exact compiler versions and settings, the makefile.include, as well as the compile output and the crashed simulation output later today. I assume that this is necessary to solve these issues.
On another note. We are using a small AIMD example study to assess the performance. However, this uses 374 atoms and has a single gamma-point simulation. Does someone have another real system to check the performance difference with (about two hours runtime on 4 cores but of course faster when we try with 192 cores) to check if the cache matters? Preferably something which is not AIMD (but pure DFT or structure optimization), with more k-points and which is already somewhat optimized in the INCAR for larger number of cores. I will share the results afterwards.
Regards,
Jelle Lagerweij
-
- Global Moderator
- Posts: 473
- Joined: Mon Nov 04, 2019 12:44 pm
Re: Compiling with AOCC/AOCL OpenMPI SGRGEN error
Before you run any larger calculations please try the testsuite that is provided with VASP:
https://www.vasp.at/wiki/index.php/Validation_tests
After that please send important files like stdout, OUTCAR, INCAR, POSCAR, POTCAR, KPOINTS from any job that failed. Preferably from the smallest job.
Please also send your makefile.include for compilation.
https://www.vasp.at/wiki/index.php/Validation_tests
After that please send important files like stdout, OUTCAR, INCAR, POSCAR, POTCAR, KPOINTS from any job that failed. Preferably from the smallest job.
Please also send your makefile.include for compilation.
-
- Newbie
- Posts: 16
- Joined: Fri Oct 20, 2023 1:13 pm
Re: Compiling with AOCC/AOCL OpenMPI SGRGEN error
Dear Ferenc,
Thanks for your reply and sorry that it took a while to answer. I prepared some files which were wrong and needed to improve them. The runtime issue still stays the same though. My college will try to use the standard after the installation to check. We were using the following setup: AOCC 4.1 and AOCL 4.1 (binary blis, libflame ect. Which you can get from the AMD website https://www.amd.com/en/developer/aocl/e ... 1.0.tar.gz). Additionally, we used OpenMPI 5.0.2, which we recompiled ourselves with AOCC as well (just in case).
I have added the makefile.include and the compiler output in the compiling subfolder and the stdout, OUTCAR, INCAR, POSCAR, POTCAR, KPOINTS in the runcase subfolder. I also reran these simulations in my working install (I have two versions: 1) gcc11 + openmpi + mkl +hdf5 and 2) gcc11 + openmpi + openblas + netlib-scalapack + fftw + hdf5). In both, they worked flawlessly. Therefore, I assumed that nothing is wrong in the input files themselves. The example case is a 25 time steps AIMD simulation which should take approximately 15 minutes to run. It was already initially this short because we were experimenting with parallel efficiency testing.
Additionally, before running the AOCC compiled version, we made sure that the correct environment was set up (as the machine has multiple environments available). Therefore, we used the following code to start the simulation:
I hope that you can provide us with some helpful insights,
Kind regards,
Jelle Lagerweij
Thanks for your reply and sorry that it took a while to answer. I prepared some files which were wrong and needed to improve them. The runtime issue still stays the same though. My college will try to use the standard
Code: Select all
make test
I have added the makefile.include and the compiler output in the compiling subfolder and the stdout, OUTCAR, INCAR, POSCAR, POTCAR, KPOINTS in the runcase subfolder. I also reran these simulations in my working install (I have two versions: 1) gcc11 + openmpi + mkl +hdf5 and 2) gcc11 + openmpi + openblas + netlib-scalapack + fftw + hdf5). In both, they worked flawlessly. Therefore, I assumed that nothing is wrong in the input files themselves. The example case is a 25 time steps AIMD simulation which should take approximately 15 minutes to run. It was already initially this short because we were experimenting with parallel efficiency testing.
Additionally, before running the AOCC compiled version, we made sure that the correct environment was set up (as the machine has multiple environments available). Therefore, we used the following code to start the simulation:
Code: Select all
. /opt/AMD/setenv_AOCC.sh
export PATH=/opt/openmpi-5.0.2-aocc/bin:$PATH
export LD_LIBRARY_PATH=/opt/openmpi-5.0.2-aocc/lib:/opt/amd-fftw/lib:/opt/amd-scalapack/lib/LP64:/opt/amd-blis/lib/LP64:/opt/amd-libflame/lib/LP64:$LD_LIBRARY_PATH
OMP_NUM_THREADS=1 time mpirun -np 32 /home/grepit/TestCase_Gerben/vasp.6.4.2-AOCC/bin/vasp_gam
Kind regards,
Jelle Lagerweij
You do not have the required permissions to view the files attached to this post.
-
- Global Moderator
- Posts: 473
- Joined: Mon Nov 04, 2019 12:44 pm
Re: Compiling with AOCC/AOCL OpenMPI SGRGEN error
Short question befor I look at everything in detail.
So could you run any calculation (the ones from the testsuite) or do you get the error message everytime?
So could you run any calculation (the ones from the testsuite) or do you get the error message everytime?
-
- Newbie
- Posts: 16
- Joined: Fri Oct 20, 2023 1:13 pm
Re: Compiling with AOCC/AOCL OpenMPI SGRGEN error
Dear Ferenc,
I believe that we got this error every time. The machine was with someone else (our system administrator), but he mentioned that he got this issue in all test cases when using the aocc or the intel compilers (although both compiled successfully). He also used the standard gcc+openmpi+openblas+netlib-scalapack+fftw installation method. In that case, everything worked fine.
I am currently trying the manual provided by AMD themselves (https://www.amd.com/en/developer/zen-so ... /vasp.html). The only drawback is that vasp is licensed software and that spack uses a checksum on the compressed folder to see if you have a correct version. This is totally fine to me, except that no version 6.4.2 is implemented in spack at this point, but my compressed vasp files are (and I rely on vasp 6.4+ features in some larger simulations). I am currently adjusting the spack installation method myself (after spack install, I use <spack edit vasp> and added the version 6.4.2 with the checksum I retrieved from my official vasp 6.4.2 version). I want to see how this works as well.
Kind regards,
Jelle Lagerweij
I believe that we got this error every time. The machine was with someone else (our system administrator), but he mentioned that he got this issue in all test cases when using the aocc or the intel compilers (although both compiled successfully). He also used the standard gcc+openmpi+openblas+netlib-scalapack+fftw installation method. In that case, everything worked fine.
I am currently trying the manual provided by AMD themselves (https://www.amd.com/en/developer/zen-so ... /vasp.html). The only drawback is that vasp is licensed software and that spack uses a checksum on the compressed folder to see if you have a correct version. This is totally fine to me, except that no version 6.4.2 is implemented in spack at this point, but my compressed vasp files are (and I rely on vasp 6.4+ features in some larger simulations). I am currently adjusting the spack installation method myself (after spack install, I use <spack edit vasp> and added the version 6.4.2 with the checksum I retrieved from my official vasp 6.4.2 version). I want to see how this works as well.
Kind regards,
Jelle Lagerweij
-
- Global Moderator
- Posts: 473
- Joined: Mon Nov 04, 2019 12:44 pm
Re: Compiling with AOCC/AOCL OpenMPI SGRGEN error
It's really hard to debug from here. The code has problems in the symmetry routines, but that is most likely some aftereffect due to an unfunctioning compilation.
I hope it helps what you wrote.
If possible please try first this compilers/toolchains:
3.2.0_aocl-3.1_ompi-4.1.2, amdscalapack/3.1, amdblis/3.1
This is what we use and it is very stable.
I hope it helps what you wrote.
If possible please try first this compilers/toolchains:
3.2.0_aocl-3.1_ompi-4.1.2, amdscalapack/3.1, amdblis/3.1
This is what we use and it is very stable.
-
- Newbie
- Posts: 9
- Joined: Mon Dec 10, 2012 7:15 pm
Re: Compiling with AOCC/AOCL OpenMPI SGRGEN error
We ran into the same problem with AOCC/AOCL.
On Mar 01, we compiled VASP 6.4.2 with OpenMPI and OpenMP on AMD 2X EPYC 7713 cluster (128 cores per node) after module loading aocc/4.1.0 openmpi/4.1.6 amdblis/4.1 amdlibflame/4.1 amdscalapack/4.1 amdfftw/4.1. The compilation seemed to be successful but calculations always stop and give the following error:
| VERY BAD NEWS! internal error in subroutine SGRGEN: Too many |
| elements 49 |
On Mar 01, we compiled VASP 6.4.2 with OpenMPI and OpenMP on AMD 2X EPYC 7713 cluster (128 cores per node) after module loading aocc/4.1.0 openmpi/4.1.6 amdblis/4.1 amdlibflame/4.1 amdscalapack/4.1 amdfftw/4.1. The compilation seemed to be successful but calculations always stop and give the following error:
| VERY BAD NEWS! internal error in subroutine SGRGEN: Too many |
| elements 49 |
-
- Newbie
- Posts: 25
- Joined: Wed Jul 20, 2022 7:18 am
Re: Compiling with AOCC/AOCL OpenMPI SGRGEN error
Hi, have you solved this problem? I have met same problem at only one specific model. This error vanished when calculating other models.jelle_lagerweij wrote: ↑Wed Feb 21, 2024 9:48 am Dear Ferenc,
I believe that we got this error every time. The machine was with someone else (our system administrator), but he mentioned that he got this issue in all test cases when using the aocc or the intel compilers (although both compiled successfully). He also used the standard gcc+openmpi+openblas+netlib-scalapack+fftw installation method. In that case, everything worked fine.
I am currently trying the manual provided by AMD themselves (https://www.amd.com/en/developer/zen-so ... /vasp.html). The only drawback is that vasp is licensed software and that spack uses a checksum on the compressed folder to see if you have a correct version. This is totally fine to me, except that no version 6.4.2 is implemented in spack at this point, but my compressed vasp files are (and I rely on vasp 6.4+ features in some larger simulations). I am currently adjusting the spack installation method myself (after spack install, I use <spack edit vasp> and added the version 6.4.2 with the checksum I retrieved from my official vasp 6.4.2 version). I want to see how this works as well.
Kind regards,
Jelle Lagerweij
-
- Newbie
- Posts: 16
- Joined: Fri Oct 20, 2023 1:13 pm
Re: Compiling with AOCC/AOCL OpenMPI SGRGEN error
Hi All,
small update, I have not been able to solve this issue and neither has my system administrator. The testing machine (with extra large cashing) is not available to us anymore, and I went back to using openmpi/openblas/netlib-scalapack/fftw3 installation with gcc11. I am still not sure what is exactly the issue, I created the tool chain mentioned by Ferenc with spack, but still had some issues while compiling and my old installation was working fine. We were just unsure if we got the most out of our compute time and interested in how impactful the change in compiler and math libraries would be.
Kind regards,
Jelle
small update, I have not been able to solve this issue and neither has my system administrator. The testing machine (with extra large cashing) is not available to us anymore, and I went back to using openmpi/openblas/netlib-scalapack/fftw3 installation with gcc11. I am still not sure what is exactly the issue, I created the tool chain mentioned by Ferenc with spack, but still had some issues while compiling and my old installation was working fine. We were just unsure if we got the most out of our compute time and interested in how impactful the change in compiler and math libraries would be.
Kind regards,
Jelle
-
- Newbie
- Posts: 9
- Joined: Mon Dec 10, 2012 7:15 pm
Re: Compiling with AOCC/AOCL OpenMPI SGRGEN error
Tested again with OFLAG set to -O1 and vasp-6.4.2 can be compiled with aocc-4.2.0/aocl-4.2.0 along with openmpi-5.0.2. Preliminary tests indicate that It is even faster than gcc-12.2.0 build compiled with openmpi-4.0.4, fftw-3.3.10, and openblas-0.3.23, using OFLAG set to -O2. Next will try to compile with aocc/aocl using OFLAG=-O2 and with only the symmetry routines set to -O1.
-
- Newbie
- Posts: 9
- Joined: Mon Dec 10, 2012 7:15 pm
Re: Compiling with AOCC/AOCL OpenMPI SGRGEN error
Simply adding symlib.o on the line of OBJECTS_O1 solved the problem for us.