SLURM阵列作业的速度很慢 [英] SLURM slow for array job

查看:312
本文介绍了SLURM阵列作业的速度很慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有节点A,B,C和D的小型集群.每个节点都有80GB RAM和32个CPU.我正在使用Slurm 17.11.7.

I have a small cluster with nodes A, B, C and D. Each node has 80GB RAM and 32 CPUs. I am using Slurm 17.11.7.

我执行了以下基准测试:

I performed the following benchmark tests:

  • 如果直接在节点A的终端上运行特定的Java命令,则会在2分钟内得到结果.
  • 如果我对单个"阵列作业(#SBATCH --array = 1-1)运行相同的命令,则我会在2分钟内再次得到结果.
  • 如果仅在节点A上对带有slurm的数组作业使用相同的参数运行相同的命令,则输出的输出为8mininutes,也就是说,它的速度慢了四倍.当然,我在这里同时运行其他31条带有不同参数的Java命令.

我已经尝试使用SelectTypeParameters = CR_CPU_Memory和SelectTypeParameters = CR_Core来获得相同的结果.

I already tried SelectTypeParameters=CR_CPU_Memory and SelectTypeParameters=CR_Core with the same result.

为什么我的阵列作业速度慢4倍?感谢您的帮助!

Why is my array job 4 times slower? Thanks for your help!

我提交的数组作业的标题如下:

The header of my array job, which I submit, looks like this:

#!/bin/bash -l
#SBATCH --array=1-42
#SBATCH --job-name exp
#SBATCH --output logs/output_%A_%a.txt
#SBATCH --error logs/error_%A_%a.txt
#SBATCH --time=20:00
#SBATCH --mem=2048
#SBATCH --cpus-per-task=1
#SBATCH -w <NodeA>

slurm.conf文件如下:

The slurm.conf file looks like:

ControlMachine=<NodeA>
ControlAddr=<IPNodeA>
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=<test_user_123>
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity

MaxJobCount=100000
MaxArraySize=15000

MinJobAge=300
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory

# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=Cluster
JobAcctGatherType=jobacct_gather/none
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdLogFile=/var/log/slurmd.log

# COMPUTE NODES
#NodeName=NameA-D> State=UNKNOWN
NodeName=<NameA> NodeAddr=<IPNodeA> State=UNKNOWN CPUs=32 RealMemory=70363
NodeName=<NameB> NodeAddr=<IPNodeB> State=UNKNOWN CPUs=32 RealMemory=70363
NodeName=<NameC> NodeAddr=<IPNodeC> State=UNKNOWN CPUs=32 RealMemory=70363
NodeName=<NameD> NodeAddr=<IPNodeD> State=UNKNOWN CPUs=32 RealMemory=70363

PartitionName=debug Nodes=<NodeA-D> Default=YES MaxTime=INFINITE State=UP

推荐答案

如果运行时间不取决于Java应用程序中参数的值,则有两种可能的解释:

If the running time does not depend on the value of the parameter in the Java application, there are two possible explanations:

您的cgroup配置都不限制您的工作,并且Java代码是多线程的.在这种情况下,如果仅运行一项作业,或者直接在节点上运行,则单个任务将并行使用多个CPU.如果您运行的作业阵列使节点饱和,则每个任务只能使用一个CPU.

Either your cgroup configuration does not confine your jobs and your Java code is multithreaded. In such case, if you run only one job, or if you run directly on the node, your single task uses several CPUs in parallel. If you run a job array that saturates the node, each task only can use a single CPU.

或者,您的节点配置有超线程.在这种情况下,如果仅运行一项作业,或者直接在节点上运行,则单个任务可以使用完整的CPU.如果您运行一个使节点饱和的作业阵列,则每个任务必须与另一个任务共享一个物理CPU.

Or, your node is configured with hyper threading. In such case, if you run only one job, or if you run directly on the node, your single task can use a full CPU. If you run a job array that saturates the node, each task must share a physical CPU with another one.

这篇关于SLURM阵列作业的速度很慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆