Hi,
I am trying to fit a MLFF using VASP. I wanted to check if my installations were perfect. So I tried an example. I have attached the INCAR, OUTCAR, POSCAR, ML_LOGFILE and the stdout here (also the ICONST since I want to sample a liquid phase at high T).
If I remove the ML tags and run just an MD, then everything is fine. In fact, I zeroed down the MD hyperparams (LANGEVIN_GAMMA etc) by doing just that. However when I start the ML training, then the training is stuck after the 1st set of electronic steps converge (as you can see from the output files). It stayed like that for like 6 hours before I canceled it.
I wonder what I'm doing wrong. Most probably it could be the installation itself? Thank you for the kind help.
Best
Sayan
MLFF training stuck after first ionic step
Moderators: Global Moderator, Moderator
-
- Newbie
- Posts: 6
- Joined: Sun Oct 16, 2022 9:49 pm
MLFF training stuck after first ionic step
You do not have the required permissions to view the files attached to this post.
-
- Newbie
- Posts: 6
- Joined: Sun Oct 16, 2022 9:49 pm
Re: MLFF training stuck after first ionic step
Sorry I saw I should also post KPOINTS and jobscript. I compiled it on SDSC PSC Bridges https://www.psc.edu/resources/bridges-2/
KPOINTS
Si
0 0 0
Gamma
4 4 4
0 0 0
jobscript
#!/bin/bash
#SBATCH -t 48:00:00
#SBATCH -p RM
#SBATCH --nodes 2
#SBATCH --ntasks-per-node=120
ulimit -s unlimited
module load intel intelmpi cuda hdf5 # same ones with which it was compiled
export OMP_NUM_THREADS=1
mpirun vasp.6.3.2/bin/vasp_std > vasp.out
KPOINTS
Si
0 0 0
Gamma
4 4 4
0 0 0
jobscript
#!/bin/bash
#SBATCH -t 48:00:00
#SBATCH -p RM
#SBATCH --nodes 2
#SBATCH --ntasks-per-node=120
ulimit -s unlimited
module load intel intelmpi cuda hdf5 # same ones with which it was compiled
export OMP_NUM_THREADS=1
mpirun vasp.6.3.2/bin/vasp_std > vasp.out
-
- Global Moderator
- Posts: 473
- Joined: Mon Nov 04, 2019 12:44 pm
Re: MLFF training stuck after first ionic step
I just ran your calculation it ran without any problem. I also tried it with 8 and 128 cores and it ran fine.
So it is most likely a problem of your installation.
Try the following:
-) Compile without scaLAPACK (remove -DscaLAPACK from your CPP_OPTIONS in the makefile.include).
-) Compile wihout shared memory (remove -Duse_shmem in CPP_OPTIONS).
You used 240 in your calculation.
Don't use so many It's enough to try it with 8 cores.
I also saw that you have TEBEG=1800 and TEEND=800 in your calculation. Never run cooling runs in on-the-fly machine learning. Always use heating runs. Otherwise the automatic threshold determination can get stuck. This is also explained on our best practices wiki page:
wiki/index.php/Best_practices_for_machi ... rce_fields
So it is most likely a problem of your installation.
Try the following:
-) Compile without scaLAPACK (remove -DscaLAPACK from your CPP_OPTIONS in the makefile.include).
-) Compile wihout shared memory (remove -Duse_shmem in CPP_OPTIONS).
You used 240 in your calculation.
Don't use so many It's enough to try it with 8 cores.
I also saw that you have TEBEG=1800 and TEEND=800 in your calculation. Never run cooling runs in on-the-fly machine learning. Always use heating runs. Otherwise the automatic threshold determination can get stuck. This is also explained on our best practices wiki page:
wiki/index.php/Best_practices_for_machi ... rce_fields
-
- Newbie
- Posts: 6
- Joined: Sun Oct 16, 2022 9:49 pm
Re: MLFF training stuck after first ionic step
Hi,
Possibly it was the issue with wither libbeef installation or I was running out of stack size. But now it is fixed. I also fixed the other issue you suggested (it was just a check to see if ML training worked).
Best
Sayan
Possibly it was the issue with wither libbeef installation or I was running out of stack size. But now it is fixed. I also fixed the other issue you suggested (it was just a check to see if ML training worked).
Best
Sayan
-
- Global Moderator
- Posts: 473
- Joined: Mon Nov 04, 2019 12:44 pm
Re: MLFF training stuck after first ionic step
Good to hear everything works now. Thanks for sharing your solution.