MLFF Stucking in Learning
Moderators: Global Moderator, Moderator
-
- Jr. Member
- Posts: 51
- Joined: Thu Apr 06, 2023 12:25 pm
MLFF Stucking in Learning
Dear all,
I am testing MLFF which works well and collected 1183 structures. However, when I want to continue learning for additional 3000 steps with potim=0.5fs, it stuck at the 3000th step. The algorithm does not trigger any learning step for 2999th steps taking 1.5h. The remaining 22.5h is spent in the last step and it is cancelled due to the time limit of the cluster.
I suspect that the number of reference configurations are too large ~10000 to fit, but I do not know exactly why the last step could not be completed in 22.5h. There is 750GBx4 memory associated to this job I would appreciate any help. The files are attached.
Regards,
Burak
I am testing MLFF which works well and collected 1183 structures. However, when I want to continue learning for additional 3000 steps with potim=0.5fs, it stuck at the 3000th step. The algorithm does not trigger any learning step for 2999th steps taking 1.5h. The remaining 22.5h is spent in the last step and it is cancelled due to the time limit of the cluster.
I suspect that the number of reference configurations are too large ~10000 to fit, but I do not know exactly why the last step could not be completed in 22.5h. There is 750GBx4 memory associated to this job I would appreciate any help. The files are attached.
Regards,
Burak
You do not have the required permissions to view the files attached to this post.
-
- Global Moderator
- Posts: 249
- Joined: Mon Apr 26, 2021 7:40 am
Re: MLFF Stucking in Learning
Dear Burak,
indeed it seems there is an issue here. The initial learning in step 0 was successful, so we can assume that there was enough memory in the beginning. During the MD run two additional configurations were collected in the "threshold" steps and kept as candidates for fitting. Because the buffer for new configurations (by default this is ML_MCONF_NEW = 5) was not filled until the very end of the trajectory, training was triggered at the end. Usually if many configurations are added during MD there can be memory issues at the training step (lazy allocations) but in your case only very little extra space was used. So we would assume the final training also to work correctly.
To further investigate this issue please send us all the files we would need to start the training, i.e., additionally the POSCAR, KPOINTS and POTCAR files. Also, please add the OUTCAR, OSZICAR and job submission script as listed in our forum posting guidelines. If possible, try to run the exact same simulation again so we can be sure it is reproducible on your side.
All the best,
Andreas Singraber and Ferenc Karsai
indeed it seems there is an issue here. The initial learning in step 0 was successful, so we can assume that there was enough memory in the beginning. During the MD run two additional configurations were collected in the "threshold" steps and kept as candidates for fitting. Because the buffer for new configurations (by default this is ML_MCONF_NEW = 5) was not filled until the very end of the trajectory, training was triggered at the end. Usually if many configurations are added during MD there can be memory issues at the training step (lazy allocations) but in your case only very little extra space was used. So we would assume the final training also to work correctly.
To further investigate this issue please send us all the files we would need to start the training, i.e., additionally the POSCAR, KPOINTS and POTCAR files. Also, please add the OUTCAR, OSZICAR and job submission script as listed in our forum posting guidelines. If possible, try to run the exact same simulation again so we can be sure it is reproducible on your side.
All the best,
Andreas Singraber and Ferenc Karsai
-
- Jr. Member
- Posts: 51
- Joined: Thu Apr 06, 2023 12:25 pm
Re: MLFF Stucking in Learning
Dear Andreas and Ference,
thanks for the answer and sorry for the incomplete file set. You can find the requested file via
https://www.dropbox.com/scl/fo/ubdrs42b ... zjycq&dl=0
This is a contiuning learning run and the problematic step was 25th correspoding to 250000 step, hence it can take time to reproduce all. However, I run this problematic step more than once and the result was the same.
Regards,
Burak
thanks for the answer and sorry for the incomplete file set. You can find the requested file via
https://www.dropbox.com/scl/fo/ubdrs42b ... zjycq&dl=0
This is a contiuning learning run and the problematic step was 25th correspoding to 250000 step, hence it can take time to reproduce all. However, I run this problematic step more than once and the result was the same.
Regards,
Burak
-
- Jr. Member
- Posts: 51
- Joined: Thu Apr 06, 2023 12:25 pm
Re: MLFF Stucking in Learning
Dear Andreas and Ference,
I would like to follow up with this issue. Do you have any update? I still have this issue.
Best Regards,
Burak
I would like to follow up with this issue. Do you have any update? I still have this issue.
Best Regards,
Burak
-
- Global Moderator
- Posts: 473
- Joined: Mon Nov 04, 2019 12:44 pm
Re: MLFF Stucking in Learning
I had to suddenly take over for Andreas.
I cannot acces the uploaded files. Please always upload the small files here so that they can always be downloaded and if you have larger files that can be additionally uploaded on an external plattform.
I cannot acces the uploaded files. Please always upload the small files here so that they can always be downloaded and if you have larger files that can be additionally uploaded on an external plattform.
-
- Jr. Member
- Posts: 51
- Joined: Thu Apr 06, 2023 12:25 pm
Re: MLFF Stucking in Learning
Sorry, my dropbox was out of sync. Here are the files
https://www.dropbox.com/scl/fo/ubdrs42b ... dy34y&dl=0
Regards,
Burak
https://www.dropbox.com/scl/fo/ubdrs42b ... dy34y&dl=0
Regards,
Burak
-
- Global Moderator
- Posts: 473
- Joined: Mon Nov 04, 2019 12:44 pm
Re: MLFF Stucking in Learning
So I took the ML_AB, POSCAR, INCAR that you have uploaded and changed to NSW=3000 in the INCAR to continue training for additional 3000 steps.
As in your case learning was only done in the 3000th step, but for me everything works fine. On 64 cores of a newer AMD Zen processor it needed 415 seconds.
So it must be a problem with the VASP on your side. I see you are using VASP.6.4.1.
Please download the latest version VASP.6.4.2 and try maybe different compilers.
Also for the moment try to run the same number of cores I ran which is 64 and use the same INCAR file that I used here:
As in your case learning was only done in the 3000th step, but for me everything works fine. On 64 cores of a newer AMD Zen processor it needed 415 seconds.
So it must be a problem with the VASP on your side. I see you are using VASP.6.4.1.
Please download the latest version VASP.6.4.2 and try maybe different compilers.
Also for the moment try to run the same number of cores I ran which is 64 and use the same INCAR file that I used here:
Code: Select all
SYSTEM = Naphtalene
ISYM = 0 ! no symmetry imposed
! ab initio
PREC = A
IVDW = 2
ALGO = FAST
ISMEAR = 0
SIGMA = 0.04 ! smearing in eV
ENCUT = 1000
EDIFF = 1e-6
NBANDS = 320
LWAVE = F
LCHARG = F
LREAL = F
! MD
IBRION = 0 ! MD (treat ionic degrees of freedom)
NSW = 3000 ! no of ionic steps
POTIM = 0.5 ! MD time step in fs
MDALGO = 4
NHC_NCHAINS = 4
TEBEG = 295 ! temperature
ISIF = 2 ! update positions, no cell shape and volume
! machine learning
ML_LMLFF = T
ML_MODE = train
ML_WTSIF = 2
ML_IALGO_LINREG=1
ML_SION1=0.3
ML_MRB2=12
# LPLANE = .TRUE. ! if NGZ = 3*(number of cores)/NPAR = 3*NCORE
NCORE = 4
KPAR = 2
ML_CTIFOR = 1.94395421E-02
-
- Jr. Member
- Posts: 51
- Joined: Thu Apr 06, 2023 12:25 pm
Re: MLFF Stucking in Learning
Dear Ferenc,
Thanks for your help. I followed your steps, it indeed initially worked by using 64 CPU, NCORE=4 and KPAR=2. I then change the number of cores to 72, NCORE=12 and KPAR=2 to speed up the calculations and continue the learning. It worked for 30ps learning rung (10x3ps). However, I got the same stucking problem again in the next 3ps learning continuation run. Even continuing the last ML_AB for one more time step stuck. I was able to run this last step successfully when I used your suggestion 64 CPU, NCORE=4 and KPAR=2.
I am a bit puzzled here as changing the number of cores worked out but then stuck again. If it would be a compiler issue, I expect to see a consistent behavior. Do you have an idea why this problem can arise by using more than one node and suddenly appear at certain stages? I also observed the same problem in other simulations. I also tried the version 6.4.2, it did not resolved the issue. Moreover, with 64 cores the NBANDS is updated as this may change something.
You can reach the files via the links:
The stuck step: https://www.dropbox.com/scl/fo/nfekxt5c ... z6nr8&dl=0
The single step continuing from the last stuck step: https://www.dropbox.com/scl/fo/qh2e7f8u ... z4b6j&dl=0
Rerun the stuck step with 64 cores: https://www.dropbox.com/scl/fo/7joq3nu4 ... qyldt&dl=0
Regards,
Burak
Thanks for your help. I followed your steps, it indeed initially worked by using 64 CPU, NCORE=4 and KPAR=2. I then change the number of cores to 72, NCORE=12 and KPAR=2 to speed up the calculations and continue the learning. It worked for 30ps learning rung (10x3ps). However, I got the same stucking problem again in the next 3ps learning continuation run. Even continuing the last ML_AB for one more time step stuck. I was able to run this last step successfully when I used your suggestion 64 CPU, NCORE=4 and KPAR=2.
I am a bit puzzled here as changing the number of cores worked out but then stuck again. If it would be a compiler issue, I expect to see a consistent behavior. Do you have an idea why this problem can arise by using more than one node and suddenly appear at certain stages? I also observed the same problem in other simulations. I also tried the version 6.4.2, it did not resolved the issue. Moreover, with 64 cores the NBANDS is updated as this may change something.
You can reach the files via the links:
The stuck step: https://www.dropbox.com/scl/fo/nfekxt5c ... z6nr8&dl=0
The single step continuing from the last stuck step: https://www.dropbox.com/scl/fo/qh2e7f8u ... z4b6j&dl=0
Rerun the stuck step with 64 cores: https://www.dropbox.com/scl/fo/7joq3nu4 ... qyldt&dl=0
Regards,
Burak
-
- Global Moderator
- Posts: 473
- Joined: Mon Nov 04, 2019 12:44 pm
Re: MLFF Stucking in Learning
Ok, as I understand from your post, everything that is on one node works, but as soon as you go to multiple nodes it hangs.
Do you use "-Duse_shmem" alone or do you also use "-Dsysv" in your compilation?
Do you use "-Duse_shmem" alone or do you also use "-Dsysv" in your compilation?
-
- Jr. Member
- Posts: 51
- Joined: Thu Apr 06, 2023 12:25 pm
Re: MLFF Stucking in Learning
Dear Ferenc,
I have not complied it myself. Would you let me know how can I check this?
I also tried two nodes with 64 cores each so far it works. Morever, 6.4.2 version also seems to work so far, but it may fail as I have observed.
Regards,
Burak
I have not complied it myself. Would you let me know how can I check this?
I also tried two nodes with 64 cores each so far it works. Morever, 6.4.2 version also seems to work so far, but it may fail as I have observed.
Regards,
Burak
-
- Jr. Member
- Posts: 51
- Joined: Thu Apr 06, 2023 12:25 pm
Re: MLFF Stucking in Learning
just one quick update version 6.4.2 also stuck on one another run.
Regards,
Regards,
-
- Global Moderator
- Posts: 473
- Joined: Mon Nov 04, 2019 12:44 pm
Re: MLFF Stucking in Learning
Ok I've run your calculation with 64 and 72 cores and it finishes fine. I've run with the AOCC compiler using system V shared memory(-Duse_shmem and -Dsysv). I'm not going to test all toolchains now because the calculation is quite time consuming.
Could you find out find out the toolchain with which it fails for you? It could be a problem with your compilation but it could be also a bug which comes out only with a specific compiler (I had that already in the past).
And also ask as I already wrote if -Dsysv was used in the compilation. Try to compile with the opposite, so if -Dsysv was used then recompile without it and rerun the calculation and vice versa. Shared memory can also be sometimes a source of error. You can't run without it, because the job is too big to fit into memory without it on so many cores.
Could you find out find out the toolchain with which it fails for you? It could be a problem with your compilation but it could be also a bug which comes out only with a specific compiler (I had that already in the past).
And also ask as I already wrote if -Dsysv was used in the compilation. Try to compile with the opposite, so if -Dsysv was used then recompile without it and rerun the calculation and vice versa. Shared memory can also be sometimes a source of error. You can't run without it, because the job is too big to fit into memory without it on so many cores.