Large BSE matrix diagonalization

Message

xiaoming_wang · #1 Post by **xiaoming_wang** » Mon Dec 09, 2024 1:16 am

I'm conducting convergence studies on the k mesh for my BSE calculations. The dimension of the BSE matrix is: NBANDSO=18, NBANDSV=24. I have successfully tested BSE calculations on 6x6x6, 7x7x7,8x8x8,9x9x9, although the ranks are large. However, when I moved to 10x10x10, I got the following error message:

Code: Select all

 -----------------------------------------------------------------------------
|                                                                             |
|     EEEEEEE  RRRRRR   RRRRRR   OOOOOOO  RRRRRR      ###     ###     ###     |
|     E        R     R  R     R  O     O  R     R     ###     ###     ###     |
|     E        R     R  R     R  O     O  R     R     ###     ###     ###     |
|     EEEEE    RRRRRR   RRRRRR   O     O  RRRRRR       #       #       #      |
|     E        R   R    R   R    O     O  R   R                               |
|     E        R    R   R    R   O     O  R    R      ###     ###     ###     |
|     EEEEEEE  R     R  R     R  OOOOOOO  R     R     ###     ###     ###     |
|                                                                             |
|     LWWORK 1826745088 1 2304000 384000                                      |
|     ERROR in subspace rotation PSSYEVX/ PCHEEVX: not enough eigenvalues     |
|     found 0 384000                                                          |
|                                                                             |
|       ---->  I REFUSE TO CONTINUE WITH THIS SICK JOB ... BYE!!! <----       |
|                                                                             |
 -----------------------------------------------------------------------------

which seems to be related to the BSE matrix diagonalization using the Scalapack routine. Is there any way to get rid of it?
Btw, I'm confident that no memory problems here.

Best,
Xiaoming Wang

#2 Post by **merzuk.kaltak** » Mon Dec 09, 2024 11:27 am

Dear Xiaoming Wang,

would you please upload INCAR, POSCAR, POTCAR, KPOINTS as well as OUTCAR and stdout files?
If possible, please upload also the job script.
This would help us to find the real cause of the problem.

xiaoming_wang · #3 Post by **xiaoming_wang** » Mon Dec 09, 2024 1:21 pm

Hi,

Please find the attached files.

Best,
Xiaoming Wang

#4 Post by **merzuk.kaltak** » Mon Dec 09, 2024 2:34 pm

Dear Xaioming Wang,

it seems your jobscript has some inconsistent settings.
The header of vasp's output reads:

Code: Select all

 running   96 mpi-ranks, with   32 threads/rank, on   48 nodes
 distrk:  each k-point on   96 cores,    1 groups
 distr:  one band on    1 cores,   96 groups

This indicates that you run 32 OpenMP threads per MPI rank, which seems quite excessive to me.
You probably have to set OMP_NUM_THREADS=1 or 2, depending on which hardware you are running your job.
Also, you probably have to edit your job script accordingly, because following settings seem iffy to me too:

Code: Select all

#SBATCH --nodes=48 # <--- seems to be a large number of compute nodes 
#SBATCH --tasks-per-node=2
#SBATCH --cpus-per-task=32

We have a very good summary on our wiki describing how to combine OpenMPI and OpenMP.
In case you insist using srun to submit jobs on the cluster, I suggest studying the official slurm documentation on multi-core multi-thread architectures.

xiaoming_wang · #5 Post by **xiaoming_wang** » Mon Dec 09, 2024 3:30 pm

Thanks. I'll try to reduce the number of threads as well as the nodes.

I also tried on gpu nodes with the following script

Code: Select all

#!/bin/bash
#SBATCH -N 24
#SBATCH -C gpu
#SBATCH -G 96
#SBATCH -q regular
#SBATCH -t 24:00:00

#OpenMP settings:
export OMP_NUM_THREADS=16
export OMP_PLACES=threads
export OMP_PROC_BIND=spread

#run the application:
#applications may perform better with --gpu-bind=none instead of --gpu-bind=single:1 
srun -n 96 -c 32 --cpu_bind=cores -G 96 --gpu-bind=none vasp_std > log

However, the same issue occurred. Now I am trying to reduce the threads and nodes with

Code: Select all

#!/bin/bash
#SBATCH -N 8
#SBATCH -C gpu
#SBATCH -G 32
#SBATCH -q regular
#SBATCH -t 24:00:00

#OpenMP settings:
export OMP_NUM_THREADS=2
export OMP_PLACES=threads
export OMP_PROC_BIND=spread

#run the application:
#applications may perform better with --gpu-bind=none instead of --gpu-bind=single:1 
srun -n 32 -c 32 --cpu_bind=cores -G 32 --gpu-bind=none vasp_std > log

xiaoming_wang · #6 Post by **xiaoming_wang** » Wed Dec 11, 2024 1:12 pm

With OMP_NUM_THREADS=2, I still got the same issue. For less nodes, I got memory segmentation problem.

#7 Post by **alexey.tal** » Wed Dec 11, 2024 3:26 pm

Dear Xiaoming Wang,

I have also looked at this issue, so let's see if I'm able to help you.
Unfortunately, we can't easily test this job because it requires over 1 Tb of memory and we currently don't have machines like that.

The errors in the OUTCAR show that the calculation breaks inside the eigensolver (PCHEEVX) or more specifically in the reduction step (PCHENTRD).
It seems that you should have sufficient amount of memory on your nodes to run this calculation.
However, in the error message we can see that we need to allocate a scratch array of size 1826745088 (complex). I suspect that it causes an integer(4) overflow if we allocate it as a real array with double the size, which is often done internally in the scalapck routines.

Here are some things you could try:

increase the number of MPI ranks by factor 2 (or more), so that each rank allocates a smaller scratch array to prevent the overflow
use OMEGAMAX to exclude transitions beyond the given range and thus reduce the rank of the BSE Hamiltonian
use the time-evolution BSE algorhtim (IBSE=1) which doesn't use the eigensolver

Furthermore, I would recommend you to do an estimate of the runtime for the eigensolver beforehand as I suspect that it would take much more than 24 hours on 96 MPI ranks. You have the results for the 9x9x9 k-points, so you can estimate how long the 10x10x10 calculation would take considering that the eigensolver scales cubically with the matrix size.

Regarding the GPU offloading. In VASP 6.4.3 the GPU support in BSE is limited as the library for efficient matrix distribution (cuBLASMp) was release after VASP 6.4.3 and the performance without this library is rather poor.
So we recommend waiting for the release of VASP 6.5, where cuBLASMp library is fully supported and one should expect a good performance.
Furthermore, to solve the matrix of rank 384000, the GPU eigensolver (cuSOLVERMp) would require around 4-5 Tb of memory on GPUs.

xiaoming_wang · #8 Post by **xiaoming_wang** » Fri Dec 13, 2024 10:36 pm

I have also looked at this issue, so let's see if I'm able to help you.

Thanks Alexey!

alexey.tal wrote: ↑Wed Dec 11, 2024 3:26 pm
Unfortunately, we can't easily test this job because it requires over 1 Tb of memory and we currently don't have machines like that.

I assumed that this 1 TB mem was distributed over the nodes not on a single node, right?

alexey.tal wrote: ↑Wed Dec 11, 2024 3:26 pm
The errors in the OUTCAR show that the calculation breaks inside the eigensolver (PCHEEVX) or more specifically in the reduction step (PCHENTRD).
It seems that you should have sufficient amount of memory on your nodes to run this calculation.
However, in the error message we can see that we need to allocate a scratch array of size 1826745088 (complex). I suspect that it causes an integer(4) overflow if we allocate it as a real array with double the size, which is often done internally in the scalapck routines.

Indeed, I'm confused with this based on my calculations. I tried 9x9x9 k mesh with matrix rank of 314928, the run of which was successful. However, I also tried 10*10*8 k grid with a smaller matrix rank of 307200 (I reduced NBANDSO from 18 to 16), the same issue as that of 10*10*10 occurred. So, larger rank (314928) had no problem, smaller rank (307200) did.

alexey.tal wrote: ↑Wed Dec 11, 2024 3:26 pm
Here are some things you could try:
- increase the number of MPI ranks by factor 2 (or more), so that each rank allocates a smaller scratch array to prevent the overflow
- use OMEGAMAX to exclude transitions beyond the given range and thus reduce the rank of the BSE Hamiltonian
- use the time-evolution BSE algorhtim (IBSE=1) which doesn't use the eigensolver

Thanks for your suggestions. I tried double the MPI ranks but the problem persists. I believe that if it is the memory issue, doubling the MPI rank should solve the problem, given the ranks of 10*10*10 and 9*9*9 meet (432000/314928)^2 < 2. Am I missing something?
At this time, I need all the BSE eigenvectors for further processing. So, I prefer to diagonalize the whole matrix other than using OMEGAMAX or time-evolution algorithm.

alexey.tal wrote: ↑Wed Dec 11, 2024 3:26 pm
Furthermore, I would recommend you to do an estimate of the runtime for the eigensolver beforehand as I suspect that it would take much more than 24 hours on 96 MPI ranks. You have the results for the 9x9x9 k-points, so you can estimate how long the 10x10x10 calculation would take considering that the eigensolver scales cubically with the matrix size.

You are right. The estimated time for 10*10*10 is 25 hours.

alexey.tal wrote: ↑Wed Dec 11, 2024 3:26 pm
Regarding the GPU offloading. In VASP 6.4.3 the GPU support in BSE is limited as the library for efficient matrix distribution (cuBLASMp) was release after VASP 6.4.3 and the performance without this library is rather poor.
So we recommend waiting for the release of VASP 6.5, where cuBLASMp library is fully supported and one should expect a good performance.
Furthermore, to solve the matrix of rank 384000, the GPU eigensolver (cuSOLVERMp) would require around 4-5 Tb of memory on GPUs.

Thanks for this information.

#9 Post by **alexey.tal** » Mon Dec 16, 2024 1:00 pm

I assumed that this 1 TB mem was distributed over the nodes not on a single node, right?

That is correct. The memory is distributed. But we don't have InfiniBand connection between nodes, so to be able to test your job I need to fit it in 1 Tb.

Indeed, I'm confused with this based on my calculations. I tried 9x9x9 k mesh with matrix rank of 314928, the run of which was successful. However, I also tried 10108 k grid with a smaller matrix rank of 307200 (I reduced NBANDSO from 18 to 16), the same issue as that of 101010 occurred. So, larger rank (314928) had no problem, smaller rank (307200) did.

Could you please show the OUTCAR and stdout from these test?

Thanks for your suggestions. I tried double the MPI ranks but the problem persists. I believe that if it is the memory issue, doubling the MPI rank should solve the problem, given the ranks of 101010 and 999 meet (432000/314928)2 < 2. Am I missing something?

I agree with you that it is probably not a memory issue. My conjecture was that the dimentions of the scratch array exceed the integer(4) size and that causes the issue.

The 10x10x8 calculation has a rank of 307200 and I was able to test it myself. If I run this calculation on a node with 1 Tb and use 64 MPI ranks it breaks with the error:
PCHENTRD parameter number 13 had an illegal value.
But if I run this job on the same node but with 128 MPI ranks it doesn't break. I didn't run the full diagonalization but considering that in the case of 64 MPI ranks it break right away in the eigensolver, I think that it should solve the problem.

At this time, I need all the BSE eigenvectors for further processing. So, I prefer to diagonalize the whole matrix other than using OMEGAMAX or time-evolution algorithm.

Of course, if you would like to use the eigenvectors you need to use the exact diagonalization algorithm.

One thing I noticed in your INCAR. You use 16 occupied bands (NBANDSO = 16) and 24 empty bands (NBANDSV= 24), but I saw that you have 48 electrons in your system.

xiaoming_wang · #10 Post by **xiaoming_wang** » Thu Dec 19, 2024 5:20 am

Please find the attached OUTCAR and log files.

alexey.tal wrote: ↑Mon Dec 16, 2024 1:00 pm
The 10x10x8 calculation has a rank of 307200 and I was able to test it myself. If I run this calculation on a node with 1 Tb and use 64 MPI ranks it breaks with the error:
` PCHENTRD parameter number 13 had an illegal value`.

This error is the same as mine as in the log file.

One thing I noticed in your INCAR. You use 16 occupied bands (NBANDSO = 16) and 24 empty bands (NBANDSV= 24), but I saw that you have 48 electrons in your system.

Yes, I excluded the bottom 8 bands which are tested to show almost no contribution.

#11 Post by **alexey.tal** » Thu Dec 19, 2024 7:50 am

Thank you for providing the files.

This error is the same as mine as in the log file.

Exactly. But I also see that in your 10x10x8 calculation you have 48 MPI ranks, but in the 9x9x9 calculation you used 96 MPI ranks.
As I wrote above, I can reproduce this error for the 10x10x8 job if I run it on 64 MPI ranks, but it goes away if I use all 128 MPI ranks on the node.
Have you tried running 10x10x8 or 10x10x10 jobs on a larger number of MPI ranks?
Furthermore, in your OUTCAR files I see that the number of threads is also very large 32 per rank.
Have you tried using more MPI ranks instead of these OMP threads?

Yes, I excluded the bottom 8 bands which are tested to show almost no contribution.

That makes perfect sense.

xiaoming_wang · #12 Post by **xiaoming_wang** » Thu Dec 19, 2024 2:32 pm

Thanks!
You are right that I may need go more MPI ranks. The calculations I did with at most 96 MPI ranks were because I only have 96 bands. So, to go with more MPI ranks, I think I need add more bands in the beginning of the DFT calculations.

My Community

Large BSE matrix diagonalization

Large BSE matrix diagonalization

Re: Large BSE matrix diagonalization

Re: Large BSE matrix diagonalization

Re: Large BSE matrix diagonalization

Re: Large BSE matrix diagonalization

Re: Large BSE matrix diagonalization

Re: Large BSE matrix diagonalization

Re: Large BSE matrix diagonalization

Re: Large BSE matrix diagonalization

Re: Large BSE matrix diagonalization

Re: Large BSE matrix diagonalization

Re: Large BSE matrix diagonalization