R并行集群工作进程永不返回 [英] R parallel cluster worker process never returns

查看:97
本文介绍了R并行集群工作进程永不返回的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用doParallel包,使用以下语法在多台Linux机器上并行化作业:

I am using the doParallel package to parallelize jobs across multiple Linux machines using the following syntax:

cl <- makePSOCKcluster(machines, outfile = '', master = system('hostname -i', intern = T))

通常,每个作业在不到一分钟的时间上在一台计算机上运行.但是,有时会有一个工作进程跑掉"并保持运行数小时,而从未返回主驱动程序进程.我可以看到该进程正在使用top运行,但是似乎该进程被某种方式卡住了,而不是实际运行任何东西. outfile=''选项不会产生任何有用的信息,因为工作进程从未真正失败过.

Typically each job would take less than 10 minutes to run on a single machine. However, sometimes there would be one worker process that would 'run away' and kept running for hours and never returned to the main driver process. I can see the process running using top, but it seems like the process is somehow stuck rather than running anything for real. The outfile='' option doesn't produce anything useful since the worker process never really failed.

这种情况经常发生,但非常随机.有时,我可以重新开始工作,然后他们会完成得很好.因此,我无法提供可复制的示例.有人对如何调查此问题有一般性建议吗?还是将来再次发生这种情况时要寻找什么?

This happens rather frequently but very randomly. Sometimes I could just re-start the jobs and they would finish fine. Therefore, I cannot provide a reproducible example. Does anyone have general suggestions on how to investigate this issue? Or what to look for when this happens again in the future?

添加更多详细信息以回应评论.我正在10台计算机上运行数千个小型仿真. IO和内存使用量都很小.我注意到工作进程在不同的机器上随机运行,没有任何模式,不一定是最繁忙的模式.我没有查看系统日志的权限,但是基于CPU/RAM的历史记录,似乎没有什么异常.

Adding more details in response to the comments. I am running thousands of small simulations on 10 machines. IO and memory usage are both minimal. I have noticed the worker process running away on different machines at random without any pattern, not necessarily the busiest ones. I don't have permission to view the system log, but based on CPU/RAM history there doesn't seem to be anything unusual.

它经常发生,因此很容易抓住失控的过程. top将表明该进程正在使用状态为R的接近100%CPU运行,因此它肯定正在运行并且没有等待.但是我也很确定,每次模拟都只需要几分钟,并且以某种方式让失控的工人一直保持不间断运行.

It happens frequently enough that it's fairly easy to catch a run-away process in action. top would show that the process is running with close to 100% CPU with status R, so it is definitely running and not waiting. But I am also quite sure that each simulation should only take minutes, and somehow the run-away worker just keeps running non-stop.

到目前为止,doParallel是我尝试过的唯一软件包.我正在探索其他选择,但是在不知道原因的情况下很难做出明智的决定.

So far doParallel is the only package I have tried. I am exploring other options, but it's hard to make an informed decision without knowing the cause.

推荐答案

在大型计算集群上,这种问题并不罕见.尽管挂起的工作进程可能不会产生任何错误消息,但是您应该在执行该工作进程的节点上检查系统日志,以查看是否已报告任何系统问题.可能有磁盘或内存错误,或者系统内存不足.如果某个节点有问题,只需不使用该节点即可解决您的问题.

This kind of problem is not uncommon on large compute clusters. Although the hung worker process may not produce any error message, you should check the system logs on the node where the worker was executed to see if any system problem has been reported. There could be disk or memory errors, or the system might have run low on memory. If a node is having problems, your problem could be solved by simply not using that node.

这是批处理排队系统有用的原因之一.花费太长时间的作业将被杀死并自动重新提交.不幸的是,它们经常在同一坏节点上重新运行该作业,因此检测出坏节点并防止调度程序将其用于后续作业很重要.

This is one of the reasons that batch queueing systems are useful. Jobs that take too long can be killed and automatically resubmitted. Unfortunately, they often rerun the job on the same bad node, so it's important to detect bad nodes and prevent the scheduler from using them for subsequent jobs.

您可能要考虑在程序中添加检查点功能.不幸的是,这通常很困难,尤其是使用doParallel后端时尤其困难,因为parallel软件包中没有检查点功能.您可能想研究doRedis后端,因为我相信作者有兴趣支持某些容错功能.

You might want to consider adding checkpointing capabilities to your program. Unfortunately, that is generally difficult, and especially difficult using the doParallel backend since there is no checkpointing capability in the parallel package. You might want to investigate the doRedis backend, since I believe the author was interested in supporting certain fault tolerance capabilities.

最后,如果您实际上是在行动中抓住了一名上吊的工人,则应该使用"ps"或"top"获得尽可能多的信息.进程状态很重要,因为例如,这可以帮助您确定进程是否在尝试执行I/O时被卡住.更好的是,您可以尝试将gdb附加到它上,并进行回溯以确定它的实际作用.

Finally, if you actually catch a hung worker in the act, you should get as much information about it as possible using "ps" or "top". The process state is important since that could help you to determine if the process is stuck trying to perform I/O, for example. Even better, you could try attaching gdb to it and get a traceback to determine what it is actually doing.

这篇关于R并行集群工作进程永不返回的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆