VASP 6.4.3 quits without Error Message

Questions regarding the compilation of VASP on various platforms: hardware, compilers and libraries, etc.


Moderators: Global Moderator, Moderator

Post Reply
Message
Author
philip_wurzner
Newbie
Newbie
Posts: 2
Joined: Wed Jun 19, 2024 1:49 pm

VASP 6.4.3 quits without Error Message

#1 Post by philip_wurzner » Wed Oct 30, 2024 12:49 pm

Hi VASP support,

Unfortunately my version of VASP (6.4.3), running on a SLURM cluster is stopping in the middle of an aiMD job without producing any error messages.
I have compiled VASP using oneapi 2022.3 on Red Hat 8.1. VASP also successfully completes its testsuite.

The simulation simply stops, seemingly at random. The last lines of the OSZICAR are:

Code: Select all

   1786 T=  1036. E= -.23538280E+04 F= -.27677929E+04 E0= -.27677929E+04  EK= 0.11378E+02 SP= 0.40E+03 SK= 0.70E-01
   1787 T=  1033. E= -.23538280E+04 F= -.27679653E+04 E0= -.27679653E+04  EK= 0.11354E+02 SP= 0.40E+03 SK= 0.76E-01
   1788 T=  1030. E= -.23538280E+04 F= -.27681294E+04 E0= -.27681294E+04  EK= 0.11314E+02 SP= 0.40E+03 SK= 0.82E-01
   1789 T=  1024. E= -.23538279E+04 F= -.27682767E+04 E0= -.27682767E+04  EK= 0.11251E+02 SP= 0.40E+03 SK= 0.87E-01
   1790 T=  1016. E= -.23538276E+04 F= -.27683977E+04 E0= -.27683977E+04  EK= 0.11158E+02 SP= 0.40E+03 SK= 0.90E-01
       N       E                     dE             d eps       ncg     rms          rms(c)
DAV:   1    -0.273911763763E+04    0.12358E+01   -0.84342E+02  3104   0.434E+01
DAV:   2    -0.274133701322E+04   -0.22194E+01   -0.22171E+01  3784   0.556E+00
DAV:   3    -0.274139721292E+04   -0.60200E-01   -0.60192E-01  4608   0.812E-01
DAV:   4    -0.274139949566E+04   -0.22827E-02   -0.22827E-02  3976   0.148E-01    0.107E+01

The INCAR is:

Code: Select all

PREC = normal #precision
ISIF = 2 #stress tensor and dof
ISYM   = 0 #no symmetry is used
EDIFF = 1e-4 #tolerance of selectronic sc loop
NELM = 60 #maximum electron sc steps
NELMIN = 4 #minimum e sc steps
ALGO = N #optimisation algo
MAXMIX = 40
NSIM = 4 #number of bands that are optimised paralell
LPLANE = T
LSCALU = F #wether to use scalapack decompsition
NWRITE = 1 #amount of info in outcar
LREAL = Auto
NBLOCK = 1 #how many steps until DOS etc is calculated
KBLOCK = 20
APACO = 20.00 #cutoff for PC function
ISMEAR = -1 #smearing of partially occupied orbitals
IBRION = 0 #MD
SMASS = 0 #Nose thermostat
SIGMA = 0.2064 #width of smearing
TEBEG = 1000 #starting temp
TEEND = 1100 #ending temp
NSW = 80000 #no of steps
POTIM = 0.4 #timesteps in fs
BMIX = 0.63
NCORE = 10
ML_LMLFF = .TRUE.
ML_MODE = train

KPOINTS:

Code: Select all

KPOINTS
0
Auto
20

And finally my slurm submission file is:

Code: Select all


#!/bin/bash

#SBATCH --job-name='ML ZrC 1000K'
#SBATCH --partition=compute
#SBATCH --time=120:00:00
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=10
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4G
#SBATCH --account=<my account here>
#SBATCH --error=error.log

export PATH="/home/pdwurzner/software/vasp.6.4.3/bin:$PATH"

srun vasp_std

The error.log file is empty. The only information I have is from SLURM itself, which says the process crashed with an exit code of 134.
Also worth noting: I am running an MLFF simulation with roughly 85 atoms and about 160GB of RAM.
I have tried to recompile VASP several times (also different versions) but the problem persists. How would you recommend I proceed?

Kind regards,

Philip


alex
Hero Member
Hero Member
Posts: 591
Joined: Tue Nov 16, 2004 2:21 pm
License Nr.: 5-67
Location: Germany

Re: VASP 6.4.3 quits without Error Message

#2 Post by alex » Wed Oct 30, 2024 3:19 pm

Hi Philip,

first guess: memory limitations.

You are training with 80000 timesteps, which is inmh far(!!!) to many.
Start with ~1000 and refine the potential if necessary. You'll see this, if you check the sizes of the different errors.

Good luck!

alex


philip_wurzner
Newbie
Newbie
Posts: 2
Joined: Wed Jun 19, 2024 1:49 pm

Re: VASP 6.4.3 quits without Error Message

#3 Post by philip_wurzner » Fri Nov 01, 2024 9:59 am

Thanks for your reply Alex.
I've figured out my problem was actually storage limitations, which don't produce an error when they terminate a program on my cluster.


Post Reply