Queries about input and output files, running specific calculations, etc.
Moderators: Global Moderator, Moderator
-
renjie_chen1
- Newbie
- Posts: 10
- Joined: Fri Jul 29, 2022 2:40 am
#1
Post
by renjie_chen1 » Fri Dec 13, 2024 4:17 am
Hi everyone,
I just encountered a memory problem in MLFF refitting. I have a lot of configurations obtained from on-the-fly learning of different models including bulk, free surfaces, interfaces, etc., by merging different ML_ABNs. The number of configurations exceeds 10000. I managed to lower the memory requirement by reducing ML_MCONF_NEW in 'select' mode, although it takes much longer. However, when I tried to do refit there was memory issues. Is there any parameters or methods that can be used to lower the memory consumption?
Thank you very much.
-
ferenc_karsai
- Global Moderator
- Posts: 473
- Joined: Mon Nov 04, 2019 12:44 pm
#2
Post
by ferenc_karsai » Fri Dec 13, 2024 10:35 am
The refit mode needs to allocate the design matrix twice and the design matrix heavily depends in one dimension on the number of training structures. In terms of parameters you can't do much, but the good thing is that the design matrix is fully distributed by scalapack. So you can use more nodes to fit it into memory.
-
renjie_chen1
- Newbie
- Posts: 10
- Joined: Fri Jul 29, 2022 2:40 am
#3
Post
by renjie_chen1 » Fri Dec 13, 2024 12:42 pm
ferenc_karsai wrote: ↑Fri Dec 13, 2024 10:35 am
The refit mode needs to allocate the design matrix twice and the design matrix heavily depends in one dimension on the number of training structures. In terms of parameters you can't do much, but the good thing is that the design matrix is fully distributed by scalapack. So you can use more nodes to fit it into memory.
Hi Ferenc,
Thank you very much for the reply. I currently have only indenpent nodes with 192 cores each. Is there any training strategy that can be adapted to, for example, separate the task into several? I have been considering to separate the configuraitons into several groups by merging script and do training for each group and then do merging again... However, I found that the refit mode does not reduce the number of basis set and this strategy should make no sense:-(. Any suggestion for that?
Many thanks.
Last edited by
renjie_chen1 on Fri Dec 13, 2024 12:53 pm, edited 2 times in total.
-
ferenc_karsai
- Global Moderator
- Posts: 473
- Joined: Mon Nov 04, 2019 12:44 pm
#4
Post
by ferenc_karsai » Fri Dec 13, 2024 1:29 pm
Since you have to do the refit in one go with all data you can only reduce the number of training structures or local reference configurations to be able to gain some memory. But in both cases you might lose accuracy.
To decrease the number of local reference configurations by refitting with ML_MODE=refitbayesian and increasing ML_EPS_LOW which enforces stronger sparsification. Alternatively you can reselect the local reference configurations by setting ML_MODE=select and ML_MB as the maximum number of local reference configurations you want.
If you want to remove training structures you have to do it manually and then must run ML_MODE=select.
-
renjie_chen1
- Newbie
- Posts: 10
- Joined: Fri Jul 29, 2022 2:40 am
#5
Post
by renjie_chen1 » Sun Dec 15, 2024 6:22 am
ferenc_karsai wrote: ↑Fri Dec 13, 2024 1:29 pm
Since you have to do the refit in one go with all data you can only reduce the number of training structures or local reference configurations to be able to gain some memory. But in both cases you might lose accuracy.
To decrease the number of local reference configurations by refitting with ML_MODE=refitbayesian and increasing ML_EPS_LOW which enforces stronger sparsification. Alternatively you can reselect the local reference configurations by setting ML_MODE=select and ML_MB as the maximum number of local reference configurations you want.
If you want to remove training structures you have to do it manually and then must run ML_MODE=select.
Hi, Ferenc,
Thank you for the guide. I just tried to reduce the number of training structures. There are now 2401 configurations, yet the refitting still stops due to out-of-memory problem (my node has 773401 MB memeroy). My header lines in the ML_AB looks like:
1.0 Version
**************************************************
The number of configurations
--------------------------------------------------
2401
**************************************************
The maximum number of atom type
--------------------------------------------------
3
**************************************************
The atom types in the data file
--------------------------------------------------
Nd Fe B
**************************************************
The maximum number of atoms per system
--------------------------------------------------
416
**************************************************
The maximum number of atoms per atom type
--------------------------------------------------
267
**************************************************
Reference atomic energy (eV)
--------------------------------------------------
0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
**************************************************
Atomic mass
--------------------------------------------------
144.240000000000 55.8470000000000 10.8110000000000
**************************************************
The numbers of basis sets per atom type
--------------------------------------------------
3102 3102 3102
So what else should I take care for the memory issues? I used 72 cores for the refitting. Does the memory allocation increase with the number of processes in the parallelization?
Thanks
-
ferenc_karsai
- Global Moderator
- Posts: 473
- Joined: Mon Nov 04, 2019 12:44 pm
#6
Post
by ferenc_karsai » Mon Dec 16, 2024 10:26 am
You have many local reference configurations and many atoms per training structure which contribute strongly to the number of rows of the design matrix.
The size of the design matrix can be calculated as:
(ML_MCONF) x [3x(number of atoms per training structure) + 1 + 6] x ML_MB x (Number of species)
ML_MCONF = Max number of training structures
ML_MB = Max number of local reference configurations per species
This design matrix needs to be allocated twice for the algorithm.
The design matrix is fully distributed via scalapack, so it doesn't matter how many cores you use per node.
The only thing that you can do for the refitting is either using more nodes or reducing the size of the design matrix.
-
renjie_chen1
- Newbie
- Posts: 10
- Joined: Fri Jul 29, 2022 2:40 am
#7
Post
by renjie_chen1 » Mon Dec 16, 2024 10:33 am
ferenc_karsai wrote: ↑Mon Dec 16, 2024 10:26 am
You have many local reference configurations and many atoms per training structure which contribute strongly to the number of rows of the design matrix.
The size of the design matrix can be calculated as:
(ML_MCONF) x [3x(number of atoms per training structure) + 1 + 6] x ML_MB x (Number of species)
ML_MCONF = Max number of training structures
ML_MB = Max number of local reference configurations per species
This design matrix needs to be allocated twice for the algorithm.
The design matrix is fully distributed via scalapack, so it doesn't matter how many cores you use per node.
The only thing that you can do for the refitting is either using more nodes or reducing the size of the design matrix.
Hi Ferenc,
Thank you very much for the explanation. That's clear.