Continuation of Machine Learning Jobs

Queries about input and output files, running specific calculations, etc.


Moderators: Global Moderator, Moderator

Locked
Message
Author
julien_steffen
Newbie
Newbie
Posts: 25
Joined: Wed Feb 23, 2022 10:18 am

Continuation of Machine Learning Jobs

#1 Post by julien_steffen » Tue May 10, 2022 4:07 pm

Dear VASP-Team,

I am doing ML-FF generation calculations on different systems, some of them are rather large and require quite long times for single AIMD steps.

Since most of our available computing clusters usually have walltime-restrictions of one day, the problem often arises that I am only able to calculate the first, e.g., 500 steps of the ML process (with ML_ISTART = 0).
If I then do the straightforward thing and start a second calculation based on the ML_AB file of the first calculation with ML_ISTART = 1, the learning process essentially needs as many AIMD steps as before, i.e., the first 10-12 steps are always calculated with AIMD and every 3rd to 5th step is calculated that way the following hundreds of steps. I am quite puzzled by this, since essentially the same dynamics done at once would of course lead to a successive convergence, such that AIMD steps are required much less often after the first few hundreds or thousands MD steps.
This behavior so far severly limits my ability to generate machine learning force fields for larger systems, since each 1-day calculation only covers around 500 MD steps, which a much too high ratio of AIMD to ML-FF and (presumably) a quite blown up training set covering only a small portion of configuration space, even if I repeat the process 10 times or so. Real convergence seems to be almost unreachable.

Is it possible to modify some of the input settings such that a calculation started with ML_ISTART = 1 behaves such it would be indeed a direct continuation of a previous ML-FF generation calculation, with for example only each 10th to 50th MD step being calculated with DFT directly from the beginning, depending on the progress being done in the previous calculation(s)?

Thank you in advance,
Julien

andreas.singraber
Global Moderator
Global Moderator
Posts: 250
Joined: Mon Apr 26, 2021 7:40 am

Re: Continuation of Machine Learning Jobs

#2 Post by andreas.singraber » Wed May 11, 2022 10:32 am

Dear Julien,

thank you for your detailed problem description. The behavior you are observing is most likely due to the value of ML_CTIFOR which is reset to its default each time you restart the MD run. This tag specifies the threshold for the Bayesian error estimate of forces. If the threshold is exceeded an ab initio calculation is triggered (actually the on-the-fly procedure is a bit more complex, but in essence that is how it works). The default value when you start a new MD run is very low, so basically the first few steps will always lead to an ab initio calculation. However, along the MD run the threshold value gets repeatedly adapted, such that reasonable amount of ab initio data is sampled.

The original idea behind ML_ISTART=1 was to allow creation of a combined force field for multiple structures, e.g., for a crystal and the liquid of some material. In this case resetting the value ML_CTIFOR is actually desired, because one cannot expect that the prediction of a force field trained only on the crystal will work immediately also for the liquid.

However, In your "continuation of an MD run" application of ML_ISTART=1 (which by the way is perfectly valid and a good thing to do) you need to prevent the resetting of ML_CTIFOR by manually setting it in the INCAR file to the last value of the previous run. This last value can be found in the ML_LOGFILE, searching for the THRUPD log lines like this:

Code: Select all

grep THRUPD ML_LOGFILE
You should get something like this:

Code: Select all

# THRUPD ####################################################################################
# THRUPD This line contains the new and old threshold for the maximum Bayesian error of forces.
# THRUPD 
# THRUPD nstep ......... MD time step or input structure counter
# THRUPD ctifor_prev ... Previous threshold for the maximum Bayesian error of forces (eV Angst^-1)
# THRUPD ctifor_new .... New threshold for the maximum Bayesian error of forces (eV Angst^-1)
# THRUPD std_sig ....... Standard deviation of the collected Bayesion errors of forces (eV Angst^-1)
# THRUPD slope_sig ..... Slope of the collected Bayesian errors of forces
# THRUPD ####################################################################################
# THRUPD            nstep      ctifor_prev       ctifor_new          std_sig        slope_sig
# THRUPD                2                3                4                5                6
# THRUPD ####################################################################################
THRUPD                 10   1.00000000E-16   1.14665296E-02   3.64338609E-01  -7.76056118E-02
THRUPD                 20   1.14665296E-02   1.06711749E-02   2.22730595E-01   8.99503708E-04
THRUPD                 48   1.06711749E-02   1.07495276E-02   2.37768936E-01   6.44653832E-02
THRUPD                 56   1.07495276E-02   1.17143839E-02   3.08669927E-01   9.42340928E-02
.....
THRUPD               8352   1.99912417E-02   2.01512448E-02   8.54340157E-02   2.32953915E-02
THRUPD               8578   2.01512448E-02   2.04824462E-02   8.42555829E-02   2.30467858E-02
THRUPD              10000   2.04824462E-02   2.06266607E-02   7.63899315E-02   1.27899696E-02
The fourth column named ctifor_new shows the threshold value along the trajectory. Now, just pick the last value and add it to your INCAR before you restart the MD run:

Code: Select all

...
ML_ISTART = 1
ML_CTIFOR = 2.06266607E-02
...
I hope this solution works for you as expected. Remember to only apply this procedure if you just need to restart from the previous run. If the input structure changes, e.g., from crystal to liquid, do not set ML_CTIFOR manually.

I am sorry that this has not yet been documented on the Wiki but we are currently working on it and there will be some major improvements in the near future.

All the best,
Andreas Singraber

julien_steffen
Newbie
Newbie
Posts: 25
Joined: Wed Feb 23, 2022 10:18 am

Re: Continuation of Machine Learning Jobs

#3 Post by julien_steffen » Fri May 13, 2022 8:05 am

Thank you very much for the detailed response! Now the following runs with ML_ISTART = 1 are indeed much faster.
I have only one follow-up question: When I restart the calculation with the manually given ML_CTIFOR value, no further ML_CTIFOR updates seem to be written out in the ML_LOGFILE, "grep THRUPD ML_LOGFILE" prints out only the header lines.
Am I right in the assumption that the parameter will not be updated during the run, if its startvalue is given manually, or is it simply not written to file?
I have tested the procedure for a small and fast training system (a box with water molecules in it), there it seems that the AIMD steps are getting indeed scarcer from restart to restart (for 5 runs in total), even if no updated ML_CTIFOR values are written to file. Therefore it seems to me that convergence can indeed be reached by this procedure.

andreas.singraber
Global Moderator
Global Moderator
Posts: 250
Joined: Mon Apr 26, 2021 7:40 am

Re: Continuation of Machine Learning Jobs

#4 Post by andreas.singraber » Fri May 13, 2022 9:01 am

Hello!

Actually the automatic threshold update functionality is still enabled even if you manually set ML_CTIFOR, it can be disabled with the INCAR tag ML_ICRITERIA. The default is ML_ICRITERIA = 1, no matter if ML_CTIFOR is set manually. However, the threshold updates may just become very rare for longer trajectories. Can you maybe post your ML_LOGFILE from the first run and the continuation run, where you do not see any more threshold updates, so that we can check if everything works ok? Thank you!

Best,
Andreas Singraber

julien_steffen
Newbie
Newbie
Posts: 25
Joined: Wed Feb 23, 2022 10:18 am

Re: Continuation of Machine Learning Jobs

#5 Post by julien_steffen » Fri May 13, 2022 3:22 pm

Hi,
I have indeed also set ML_ICRITERIA = 1 manually, just as additional safeguard. But this should then of course make no difference.
Please find the first four ML_LOGFILE files in the attachment, where the first is from the initial ML_ISTART = 0 calculation and the others are from the respective follow-ups. One can see the convergence of the machine-learning there quite nicely.
Best wishes,
Julien
You do not have the required permissions to view the files attached to this post.

Locked