Dear Vasp community,
At the end of each vasp job, the timing and accounting information is printed in the OUTCAR. However, I cannot make any sense of the maximum memory used stated there. I never checked for DFT calculations as memory is usually not a bottleneck there, but I was interested in the values in regard with RPA calculations. The value stated did not seem to make any sense at all. As comparison, I used sacct to retrieve the MaxRSS value from the submission system. This value DID make sense, but could not correlate in any way with the maximum memory stated in the OUTCAR. I give here a few examples (all calculations were run on 128cpus):
max_mem_from_OUTCAR MaxRSS
calc1 1052332 kb 465217 K
calc2 1684456 kb 1512909 K
calc3 1831260 kb 1690932 K
calc4 2680032 kb 2054935 K
Is this a bug or a feature that I do not understand?
Thank you and best regards,
Katharina
timing and accounting information
Moderators: Global Moderator, Moderator
-
- Administrator
- Posts: 295
- Joined: Mon Sep 24, 2018 9:39 am
Re: timing and accounting information
Dear Katharina,
tracking memory extactly is quite hard and slurm (sacct) has its own problems catching memory usage spikes (as explained here).
Without more information, I assume you are comparing the entry "Maximum memory used (kB):" in the OUTCAR with the MaxRSS value of "sacct" from slurm.
Because, the OUTCAR values are larger than the values reported by slurm, I assume that the ones in the OUTCAR are more reliable. Slurm is just too slow to capture all memory spikes.
Also, vasp prints always the MaxRSS of the master MPI rank, while slurm prints the maximum value of MaxRSS of all MPI ranks.
A last remark is in order here, Rss is probably not the correct measure in practice, since the amount of shared memory is usually neglected.
The Pss (proportional set size) entry written by the Linux kernel to "/proc/<PID>/smaps" is probably the right measure in practice, but we do not support MaxPss currently.
You might want to track the Pss interactively with following bash script "MonitorRAM.sh":
Unfortunately, you will have to run this script on the compute nodes as "./MonitorRAM.sh <executable name>".
tracking memory extactly is quite hard and slurm (sacct) has its own problems catching memory usage spikes (as explained here).
Without more information, I assume you are comparing the entry "Maximum memory used (kB):" in the OUTCAR with the MaxRSS value of "sacct" from slurm.
Because, the OUTCAR values are larger than the values reported by slurm, I assume that the ones in the OUTCAR are more reliable. Slurm is just too slow to capture all memory spikes.
Also, vasp prints always the MaxRSS of the master MPI rank, while slurm prints the maximum value of MaxRSS of all MPI ranks.
A last remark is in order here, Rss is probably not the correct measure in practice, since the amount of shared memory is usually neglected.
The Pss (proportional set size) entry written by the Linux kernel to "/proc/<PID>/smaps" is probably the right measure in practice, but we do not support MaxPss currently.
You might want to track the Pss interactively with following bash script "MonitorRAM.sh":
Code: Select all
#!/bin/bash
pagesize=`getconf PAGESIZE |awk '{print $1/1024}'`
for program in $* ; do
while true ; do
pids=(`ps -A | grep $program | grep -v grep | grep -v $0 | grep -v mpi | awk '{print $1}'` )
#pids=(`ps -u | grep $program | grep -v grep | grep -v $0 | grep -v mpi | awk '{print $2}'` )
sleep 0.1
if [ "${#pids[@]}" -gt "0" ] ; then
break
fi
done
echo $pids
t0=`cat /proc/uptime`
out=m_$program.$HOSTNAME.dat
if [ -f $out ] ; then
rm $out
fi
fmt=" %10.2f"
header="# time[s]"
args=",\$1-\$3"
for i in `seq 1 ${#pids[@]} ` ; do
l=$[$i+4]
fmt+=" %d"
header+=" Pss[n=$i]"
args+=", \$$l"
done
fmt+=" \n"
echo $header >> $out
while true ; do
mem=()
for pid in ${pids[@]} ; do
if [ -f /proc/$pid/smaps ] ; then
mem+=( `cat /proc/$pid/smaps | grep Pss | grep -v Swap | awk '{sum+=$2}END {print sum}'`)
fi
done
if [ "${#mem[@]}" -eq "0" ] ;then
break
fi
echo `cat /proc/uptime` $t0 ${mem[@]} | awk '{printf "'"$fmt"'" '"$args"' }'>> $out
sleep 0.25
done
done
-
- Newbie
- Posts: 30
- Joined: Thu Feb 04, 2021 12:10 pm
Re: timing and accounting information
Thank you for your answer!