用python多重处理超过了工作内存限制 [英] slurm exceeded job memory limit with python multiprocessing
问题描述
我正在使用slurm来管理一些计算,但是有时作业会因为内存不足错误而被杀死,即使事实并非如此.这个奇怪的问题一直存在于使用多处理的python作业中.
I'm using slurm to manage some of our calculations but sometimes jobs are getting killed with an out-of-memory error even though this should not be the case. This strange issue has been with python jobs using multiprocessing in particular.
这是重现此行为的最小示例
Here's a minimal example to reproduce this behavior
#!/usr/bin/python
from time import sleep
nmem = int(3e7) # this will amount to ~1GB of numbers
nprocs = 200 # will create this many workers later
nsleep = 5 # sleep seconds
array = list(range(nmem)) # allocate some memory
print("done allocating memory")
sleep(nsleep)
print("continuing with multiple processes (" + str(nprocs) + ")")
from multiprocessing import Pool
def f(i):
sleep(nsleep)
# this will create a pool of workers, each of which "seem" to use 1GB
# even though the individual processes don't actually allocate any memory
p = Pool(nprocs)
p.map(f,list(range(nprocs)))
print("finished successfully")
即使这可能在本地运行良好,但粗略的内存占用似乎可以汇总这些进程中每个进程的驻留内存,从而导致nprocs x 1GB的内存使用,而不是仅1 GB(实际的mem使用).我认为这不是应该做的事情,也不是操作系统正在做的事情,它似乎没有交换或任何内容.
Even though this may run fine locally, slurm memory acccounting seems to sum up the resident memory for each of these processes, leading to a memory use of nprocs x 1GB, rather than just 1 GB (the actual mem use). That's not what it should do I think and it's not what the OS is doing, it doesn't appear to be swapping or anything.
这是输出,如果我在本地运行代码
Here's the output, if I run the code locally
> python test-slurm-mem.py
done allocation memory
continuing with multiple processes (0)
finished successfully
还有htop的屏幕截图
And a screenshot of htop
如果我使用slurm运行相同的命令,这是输出
And here's the output if I run the same command using slurm
> srun --nodelist=compute3 --mem=128G python test-slurm-mem.py
srun: job 694697 queued and waiting for resources
srun: job 694697 has been allocated resources
done allocating memory
continuing with multiple processes (200)
slurmstepd: Step 694697.0 exceeded memory limit (193419088 > 131968000), being killed
srun: Exceeded job memory limit
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: *** STEP 694697.0 ON compute3 CANCELLED AT 2018-09-20T10:22:53 ***
srun: error: compute3: task 0: Killed
> $ sacct --format State,ExitCode,JobName,ReqCPUs,MaxRSS,AveCPU,Elapsed -j 694697.0
State ExitCode JobName ReqCPUS MaxRSS AveCPU Elapsed
---------- -------- ---------- -------- ---------- ---------- ----------
CANCELLED+ 0:9 python 2 193419088K 00:00:04 00:00:13
推荐答案
对于其他出现此问题的人:正如注释中含糊指出的那样,您需要更改文件slurm.conf
.在此文件中,您需要将选项JobAcctGatherType
设置为jobacct_gather/cgroup
(完整行:JobAcctGatherType=jobacct_gather/cgroup
).
For others coming to this: as pointed out vaguely in the comments you need to change the file slurm.conf
. In this file you need to set the option JobAcctGatherType
to jobacct_gather/cgroup
(complete line: JobAcctGatherType=jobacct_gather/cgroup
).
我以前将选项设置为jobacct_gather/linux
,这导致问题中所述的会计值错误.
I previously had the option set to jobacct_gather/linux
which led to wrong accounting values as described in the question.
这篇关于用python多重处理超过了工作内存限制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!