循环中使用的 srun: srun:作业步骤中止:等待作业步骤完成最多 32 秒 [英] srun used in a loop: srun: Job step aborted: Waiting up to 32 seconds for job step to finish
问题描述
我有一个 .sh 文件要由 srun
运行,因为我想查看脚本的动态打印输出.但是通过运行 srun job_spinup.sh southfr_exp 1 &
我总是在 2 个主循环后出现错误(由于时间限制错误而超时)......这是 .sh 文件中的主要代码.顺便说一下,我想运行一个 12 个月的模型并循环 20 次(所谓的旋转 20 次).但是错误发生在第二个循环(旋转)的 11 月...这是 job_spinup.sh 中的代码:
I got a .sh file to run by srun
because I want to see the dynamic print-out of the scripts. But by running srun job_spinup.sh southfr_exp 1 &
I always got error (time-out due to time limited error) after 2 main loops...here is the main codes in the .sh file. By the way I want to run a model of 12 months and loop it by 20 times (so-called spin-up 20 times). But the error occurs in the November of second loop (spin-up)...
Here is the code in the job_spinup.sh:
#!/bin/bash
#SBATCH -J spinup
#SBATCH -p knl_cache
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -t 10:00:00
#SBATCH -o spinup.log
#SBATCH -e spinup.log
#=========================================================================
# USAGE
# nohup ./job_spinup DOM[:EXP] nodes[:tasks_per_node:tasks_for_trip N START_ID:START_MM] &
#
# by default: EXP=spinup, N=20, START_ID=0, START_MM=1
#=========================================================================
#set -x
#
if [ $# -lt 2 ]; then
echo "Usage: $0 DOM[:EXP:VERSION] nodes[:tasks_per_node:tasks_for_trip N START_ID:START_MM]"
echo "DOM = the name of a domain"
echo "EXP = the name of an experiment"
echo "N = the number of runnings"
echo "START_ID = start id of a running"
echo "START_MM = start month of a running"
exit
fi
DOM=`echo $1 | awk '{split($1, f, ":"); print f[1]}'`
EXP=`echo $1 | awk '{split($1, f, ":"); print f[2]}'`
EXP=${EXP:-spinup}
VERSION=`echo $1 | awk '{split($1, f, ":"); print f[3]}'`
VERSION=${VERSION:--X0}
num_nodes=`echo ${2} | awk '{split($1, f, ":"); print f[1]}'`
tasks_per_node=`echo ${2} | awk '{split($1, f, ":"); print f[2]}'`
tasks_per_node=${tasks_per_node:-40}
tasks_for_trip=`echo ${2} | awk '{split($1, f, ":"); print f[3]}'`
tasks_for_trip=${tasks_for_trip:-1}
SPINUP_N=${3:-20}
START_ID=`echo $4 | awk '{split($1, f, ":"); print f[1]}'`
START_ID=${START_ID:-0}
START_MM=`echo $4 | awk '{split($1, f, ":"); print f[2]}'`
START_MM=${START_MM:-1}
# source ~/anaconda3/etc/profile.d/conda.sh
source $(conda info --base)/etc/profile.d/conda.sh
conda activate myenv
echo "***************************************"
echo " CONDA ENV ACTIVATED FOR NCO COMMAND"
echo "***************************************"
echo $SPINUP_N
#
# check if TRIP is used
LTRIP=`grep "LOASIS *= *T" OPTIONS/OPTIONS.nam | wc -l`
#
ulimit -s unlimited
ulimit -n 500000
ulimit -u 64000
unset I_MPI_PMI_LIBRARY
export OMP_NUM_THREADS=1
export DR_HOOK=0
export DR_HOOK_OPT=prof
...
YYYY=${YYYYMMDDHH::4}
MM=${YYYYMMDDHH:4:2}
j=$START_ID
while [ $j -lt $SPINUP_N ] ; do
echo " "
echo "------------------"
echo "SPINUP : $j / $SPINUP_N"
while [ $MM -le 12 ] ; do
if [ $LTRIP -eq 1 ]; then
mpirun -np $((SLURM_NTASKS - tasks_for_trip)) offline.exe : -np $tasks_for_trip trip.exe &> offline
else
#echo ${SLURM_NTASKS}
#mpirun -np ${SLURM_NTASKS} offline.exe &> offline
#srun -n 1 offline.exe &> offline
offline.exe &> offline
fi
....
# Change dates to start again
if [ $MM -eq 12 ]; then
ncap2 -O -s "'DTCUR-YEAR'=$YYYY;'DTCUR-MONTH'=1;'DTCUR-DAY'=1;'DTCUR-TIME'=0" PREP.nc PREP.nc
[ $LTRIP -eq 1 ] && ncap2 -O -s "date(:)={$YYYY,1,1,0}" TRIP_PREP.nc TRIP_PREP.nc
fi
...
done
echo '------------------'
echo ' '
MM=01
j=$(( j+1 ))
done
...
# end simulation
date >> date_$EXP
echo "***************************************"
echo " SPINUP ENDS CORRECTLY"
echo "***************************************"
conda deactivate
echo "***************************************"
echo " CONDA ENV DEACTIVATED"
echo "***************************************"
输出是这样的:
(base) [xushan@int2 southfr_exp]$ srun job_spinup.sh southfr_exp 1 &
[1] 11570
(base) [xushan@int2 southfr_exp]$ srun: job 8860513 queued and waiting for resources
srun: job 8860513 has been allocated resources
***************************************
CONDA ENV ACTIVATED FOR NCO COMMAND
***************************************
20
./job_spinup.sh: line 62: ulimit: open files: cannot modify limit: Operation not permitted
***************************************
READY TO START SPINUP on tcn991.bullx
spinup 20 0:1
***************************************
------------------
SPINUP : 0 / 20
199601
1
199602
1
199603
1
199604
1
199605
1
199606
1
199607
1
199608
1
199609
1
199610
1
199611
1
199612
1
------------------
------------------
SPINUP : 1 / 20
199601
1
199602
1
199603
1
199604
1
199605
1
199606
1
199607
1
199608
1
199609
1
199610
1
srun: Force Terminated job 8860513
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 8860513.0 ON tcn991 CANCELLED AT 2020-09-07T12:51:24 DUE TO TIME LIMIT ***
srun: error: tcn991: task 0: Terminated
srun: Terminating job step 8860513.0
有谁能帮帮我吗?多谢!我是slurm的初学者.....是因为我激活了conda环境吗?通过 squeue,我可以看到队列只持续了 5 分钟……不知道为什么……是因为 offline.exe 吗?
Is there anyone who can help me? thanks a lot! I am a beginner for slurm.....Is it because I activated a conda environment? and by squeue, I can see the queue lasts for 5 minutes only...no idea about why....is it because offline.exe?
推荐答案
srun
不像 sbatch
那样读取作业脚本.这意味着您的所有 #SBATCH
选项都将被忽略,包括您为作业设置的时间限制.因此,您的工作将转到具有默认时间限制的默认分区,这似乎只够两个循环的时间.
srun
does not read job scripts like sbatch
does. This means that all your #SBATCH
options are ignored, including the time limit you set for the job. Your job therefore goes to the default partition with the default time limit, which only seems to be enough time for two loops.
有多种解决方法:
- 使用
sbatch
并查看您的输出文件 (tail -f spinup.log
) - 使用
sbatch
并通过 sattach - 将
#SBATCH
选项作为参数添加到srun
- Use
sbatch
and take a look at your output file (tail -f spinup.log
) - Use
sbatch
and attach to the job with sattach - Add the
#SBATCH
options as parameters tosrun
这篇关于循环中使用的 srun: srun:作业步骤中止:等待作业步骤完成最多 32 秒的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!