Hello,
I have two questions:
1. In literature some use on-the-fly training to equilibrate the structure f.e at 500 K to feed in the CONTCAR from this run as a POSCAR for the actual training at 500 K. I assume this is to keep forcefield light by showing it the most probable structures for the given training condition. How does omitting this step affect the quality of the obtained FF?
2. I assume having equilibration as part of the training increases number of the structures in the training set and expands # of local configurations beyond ones that would be most relevant for the system. Sparsification of local configurations with ML_EPS_LOW or ML_MB tag is partial fix for this issue. Is there something similar for the structures? To make it more clear, I am training a forcefield to study a lightly Al doped system. I have trained on the undoped structure at 3 temperatures for 45 ps in total and accumulated ~1200 structures in my training set. After continuing the training on the doped system at the highest training temperature (my POSCAR for this training was not equilibrated) I have 1800 structures. My energy errors are in order of 10 meV/atom, and force errors are between 0.5 and 1 eV/A. I want to continue the training, but I am afraid that having a lot of structures in the training set will later undermine the speed. Is there a way to remove the structures that contribute little to the accuracy to lighten FF before training it further?
Thank you very much!
Sona
sparsifying MLFF structures
Moderators: Global Moderator, Moderator
-
- Global Moderator
- Posts: 216
- Joined: Fri Jul 01, 2022 2:17 pm
Re: sparsifying MLFF structures
Dear Sona,
Question 1) Picking up structures during the equilibration can be useful in collecting additional structures at the conditions (temperature, pressure, volume) at which your simulation will be run. So you will incorporate structures of this phase in your training set. It can also be very useful to collect structures at a simulation temperature which is higher compared to the temperature of your production run because you will have a larger part of the phase space in your reference configurations.
Question 2) It is true the more local reference structures you have in your force field the slower it will get. There is currently no way to automatically reduce the number of structures in your reference data set. But this is in principle not necessary since only the number of local reference configurations will influence the speed of your force field.
The reduce the number of local reference configurations to a minimum I would create a test set at the conditions at which you want to run your production run. You can use this test set to do a test set error analysis. A description of the test set error analysis can be found on our Best practices for machine-learned force fields site under the header Testing. With this test set, I would do a scan over ML_EPS_LOW and check how strongly you can reduce the number of local reference configurations while keeping an acceptable test set error. Maybe also the training set error which can be found in the ML_LOGFILE can be used as an error estimator. But the safest way is to use the test set error.
Another possibility would be to set the maximal number of local reference configurations ML_MB in the INCAR file and run your machine learning force field in the ML_MODE=SELECT. With this mode, the algorithm will reselect local reference configurations from the ML_AB file which is supplied by you. Also here you could do a scan over ML_AB and check by doing an error analysis how strongly you can reduce the number of local reference configurations.
There is also a section on our Best practices for machine-learned force fields page about Performance and memory consumption which might be helpful for you.
I hope this clarifies your questions.
All the best
Jonathan
Question 1) Picking up structures during the equilibration can be useful in collecting additional structures at the conditions (temperature, pressure, volume) at which your simulation will be run. So you will incorporate structures of this phase in your training set. It can also be very useful to collect structures at a simulation temperature which is higher compared to the temperature of your production run because you will have a larger part of the phase space in your reference configurations.
Question 2) It is true the more local reference structures you have in your force field the slower it will get. There is currently no way to automatically reduce the number of structures in your reference data set. But this is in principle not necessary since only the number of local reference configurations will influence the speed of your force field.
The reduce the number of local reference configurations to a minimum I would create a test set at the conditions at which you want to run your production run. You can use this test set to do a test set error analysis. A description of the test set error analysis can be found on our Best practices for machine-learned force fields site under the header Testing. With this test set, I would do a scan over ML_EPS_LOW and check how strongly you can reduce the number of local reference configurations while keeping an acceptable test set error. Maybe also the training set error which can be found in the ML_LOGFILE can be used as an error estimator. But the safest way is to use the test set error.
Another possibility would be to set the maximal number of local reference configurations ML_MB in the INCAR file and run your machine learning force field in the ML_MODE=SELECT. With this mode, the algorithm will reselect local reference configurations from the ML_AB file which is supplied by you. Also here you could do a scan over ML_AB and check by doing an error analysis how strongly you can reduce the number of local reference configurations.
There is also a section on our Best practices for machine-learned force fields page about Performance and memory consumption which might be helpful for you.
I hope this clarifies your questions.
All the best
Jonathan