循环中使用的 srun: srun:作业步骤中止:等待作业步骤完成最多 32 秒 [英] srun used in a loop: srun: Job step aborted: Waiting up to 32 seconds for job step to finish

查看:477
本文介绍了循环中使用的 srun: srun:作业步骤中止:等待作业步骤完成最多 32 秒的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 .sh 文件要由 srun 运行,因为我想查看脚本的动态打印输出.但是通过运行 srun job_spinup.sh southfr_exp 1 & 我总是在 2 个主循环后出现错误(由于时间限制错误而超时)......这是 .sh 文件中的主要代码.顺便说一下,我想运行一个 12 个月的模型并循环 20 次(所谓的旋转 20 次).但是错误发生在第二个循环(旋转)的 11 月...这是 job_spinup.sh 中的代码:

I got a .sh file to run by srun because I want to see the dynamic print-out of the scripts. But by running srun job_spinup.sh southfr_exp 1 & I always got error (time-out due to time limited error) after 2 main loops...here is the main codes in the .sh file. By the way I want to run a model of 12 months and loop it by 20 times (so-called spin-up 20 times). But the error occurs in the November of second loop (spin-up)... Here is the code in the job_spinup.sh:

#!/bin/bash
#SBATCH -J spinup
#SBATCH -p knl_cache
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -t 10:00:00
#SBATCH -o spinup.log
#SBATCH -e spinup.log
#=========================================================================
# USAGE
#   nohup ./job_spinup DOM[:EXP] nodes[:tasks_per_node:tasks_for_trip N START_ID:START_MM] &
#
# by default: EXP=spinup, N=20, START_ID=0, START_MM=1
#=========================================================================
#set -x
#
if [ $# -lt 2 ]; then
  echo "Usage: $0 DOM[:EXP:VERSION] nodes[:tasks_per_node:tasks_for_trip N START_ID:START_MM]"
  echo "DOM            = the name of a domain"
  echo "EXP            = the name of an experiment"
  echo "N              = the number of runnings"
  echo "START_ID       = start id of a running"
  echo "START_MM       = start month of a running"
  exit
fi

DOM=`echo $1 | awk '{split($1, f, ":"); print f[1]}'`
EXP=`echo $1 | awk '{split($1, f, ":"); print f[2]}'`
EXP=${EXP:-spinup}
VERSION=`echo $1 | awk '{split($1, f, ":"); print f[3]}'`
VERSION=${VERSION:--X0}
num_nodes=`echo ${2} | awk '{split($1, f, ":"); print f[1]}'`
tasks_per_node=`echo ${2} | awk '{split($1, f, ":"); print f[2]}'`
tasks_per_node=${tasks_per_node:-40}
tasks_for_trip=`echo ${2} | awk '{split($1, f, ":"); print f[3]}'`
tasks_for_trip=${tasks_for_trip:-1}
SPINUP_N=${3:-20}
START_ID=`echo $4 | awk '{split($1, f, ":"); print f[1]}'`
START_ID=${START_ID:-0}
START_MM=`echo $4 | awk '{split($1, f, ":"); print f[2]}'`
START_MM=${START_MM:-1}

# source ~/anaconda3/etc/profile.d/conda.sh
source $(conda info --base)/etc/profile.d/conda.sh
conda activate myenv
echo "***************************************"
echo " CONDA ENV ACTIVATED FOR NCO COMMAND"
echo "***************************************"
echo $SPINUP_N
#
# check if TRIP is used
LTRIP=`grep "LOASIS *= *T" OPTIONS/OPTIONS.nam | wc -l`
#
ulimit -s unlimited
ulimit -n 500000
ulimit -u 64000
unset I_MPI_PMI_LIBRARY
export OMP_NUM_THREADS=1
export DR_HOOK=0
export DR_HOOK_OPT=prof

...

YYYY=${YYYYMMDDHH::4}
MM=${YYYYMMDDHH:4:2}
j=$START_ID
while [ $j -lt $SPINUP_N ] ; do

  echo " "
  echo "------------------"
  echo "SPINUP : $j / $SPINUP_N"

  while [ $MM -le 12 ] ; do
    if [ $LTRIP -eq 1 ]; then
      mpirun -np $((SLURM_NTASKS - tasks_for_trip)) offline.exe : -np $tasks_for_trip trip.exe &> offline
    else
      #echo ${SLURM_NTASKS}
      #mpirun -np ${SLURM_NTASKS} offline.exe &> offline
      #srun -n 1 offline.exe &> offline
      offline.exe &> offline
    fi
....

# Change dates to start again
    if [ $MM -eq 12 ]; then
      ncap2 -O -s "'DTCUR-YEAR'=$YYYY;'DTCUR-MONTH'=1;'DTCUR-DAY'=1;'DTCUR-TIME'=0" PREP.nc PREP.nc
      [ $LTRIP -eq 1 ] && ncap2 -O -s "date(:)={$YYYY,1,1,0}" TRIP_PREP.nc TRIP_PREP.nc
    fi

...


  done

  echo '------------------'
  echo ' '

  MM=01
  j=$(( j+1 ))

done
...
# end simulation
date >> date_$EXP
echo "***************************************"
echo "   SPINUP ENDS CORRECTLY"
echo "***************************************"

conda deactivate
echo "***************************************"
echo "   CONDA ENV DEACTIVATED"
echo "***************************************"

输出是这样的:

(base) [xushan@int2 southfr_exp]$ srun job_spinup.sh southfr_exp 1 &
[1] 11570
(base) [xushan@int2 southfr_exp]$ srun: job 8860513 queued and waiting for resources
srun: job 8860513 has been allocated resources
***************************************
 CONDA ENV ACTIVATED FOR NCO COMMAND
***************************************
20
./job_spinup.sh: line 62: ulimit: open files: cannot modify limit: Operation not permitted
***************************************
   READY TO START SPINUP on tcn991.bullx
     spinup 20 0:1
***************************************
 
------------------
SPINUP : 0 / 20
    199601
1
    199602
1
    199603
1
    199604
1
    199605
1
    199606
1
    199607
1
    199608
1
    199609
1
    199610
1
    199611
1
    199612
1
------------------
 
 
------------------
SPINUP : 1 / 20
    199601
1
    199602
1
    199603
1
    199604
1
    199605
1
    199606
1
    199607
1
    199608
1
    199609
1
    199610
1
srun: Force Terminated job 8860513
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 8860513.0 ON tcn991 CANCELLED AT 2020-09-07T12:51:24 DUE TO TIME LIMIT ***
srun: error: tcn991: task 0: Terminated
srun: Terminating job step 8860513.0

有谁能帮帮我吗?多谢!我是slurm的初学者.....是因为我激活了conda环境吗?通过 squeue,我可以看到队列只持续了 5 分钟……不知道为什么……是因为 offline.exe 吗?

Is there anyone who can help me? thanks a lot! I am a beginner for slurm.....Is it because I activated a conda environment? and by squeue, I can see the queue lasts for 5 minutes only...no idea about why....is it because offline.exe?

推荐答案

srun 不像 sbatch 那样读取作业脚本.这意味着您的所有 #SBATCH 选项都将被忽略,包括您为作业设置的时间限制.因此,您的工作将转到具有默认时间限制的默认分区,这似乎只够两个循环的时间.

srun does not read job scripts like sbatch does. This means that all your #SBATCH options are ignored, including the time limit you set for the job. Your job therefore goes to the default partition with the default time limit, which only seems to be enough time for two loops.

有多种解决方法:

  1. 使用 sbatch 并查看您的输出文件 (tail -f spinup.log)
  2. 使用 sbatch 并通过 sattach
  3. #SBATCH 选项作为参数添加到 srun
  1. Use sbatch and take a look at your output file (tail -f spinup.log)
  2. Use sbatch and attach to the job with sattach
  3. Add the #SBATCH options as parameters to srun

这篇关于循环中使用的 srun: srun:作业步骤中止:等待作业步骤完成最多 32 秒的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆