R并行集群工作进程永不返回 [英] R parallel cluster worker process never returns

查看：97 发布时间：2020/5/24 21:31:46 r parallel-processing

本文介绍了R并行集群工作进程永不返回的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用doParallel包，使用以下语法在多台Linux机器上并行化作业:

I am using the doParallel package to parallelize jobs across multiple Linux machines using the following syntax:

cl <- makePSOCKcluster(machines, outfile = '', master = system('hostname -i', intern = T))

通常，每个作业在不到一分钟的时间上在一台计算机上运行.但是，有时会有一个工作进程跑掉"并保持运行数小时，而从未返回主驱动程序进程.我可以看到该进程正在使用top运行，但是似乎该进程被某种方式卡住了，而不是实际运行任何东西. outfile=''选项不会产生任何有用的信息，因为工作进程从未真正失败过.

Typically each job would take less than 10 minutes to run on a single machine. However, sometimes there would be one worker process that would 'run away' and kept running for hours and never returned to the main driver process. I can see the process running using top, but it seems like the process is somehow stuck rather than running anything for real. The outfile='' option doesn't produce anything useful since the worker process never really failed.

这种情况经常发生，但非常随机.有时，我可以重新开始工作，然后他们会完成得很好.因此，我无法提供可复制的示例.有人对如何调查此问题有一般性建议吗?还是将来再次发生这种情况时要寻找什么?

This happens rather frequently but very randomly. Sometimes I could just re-start the jobs and they would finish fine. Therefore, I cannot provide a reproducible example. Does anyone have general suggestions on how to investigate this issue? Or what to look for when this happens again in the future?

添加更多详细信息以回应评论.我正在10台计算机上运行数千个小型仿真. IO和内存使用量都很小.我注意到工作进程在不同的机器上随机运行，没有任何模式，不一定是最繁忙的模式.我没有查看系统日志的权限，但是基于CPU/RAM的历史记录，似乎没有什么异常.

Adding more details in response to the comments. I am running thousands of small simulations on 10 machines. IO and memory usage are both minimal. I have noticed the worker process running away on different machines at random without any pattern, not necessarily the busiest ones. I don't have permission to view the system log, but based on CPU/RAM history there doesn't seem to be anything unusual.

它经常发生，因此很容易抓住失控的过程. top将表明该进程正在使用状态为R的接近100％CPU运行，因此它肯定正在运行并且没有等待.但是我也很确定，每次模拟都只需要几分钟，并且以某种方式让失控的工人一直保持不间断运行.

It happens frequently enough that it's fairly easy to catch a run-away process in action. top would show that the process is running with close to 100% CPU with status R, so it is definitely running and not waiting. But I am also quite sure that each simulation should only take minutes, and somehow the run-away worker just keeps running non-stop.

到目前为止，doParallel是我尝试过的唯一软件包.我正在探索其他选择，但是在不知道原因的情况下很难做出明智的决定.

So far doParallel is the only package I have tried. I am exploring other options, but it's hard to make an informed decision without knowing the cause.

R并行集群工作进程永不返回 [英] R parallel cluster worker process never returns

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

R并行集群工作进程永不返回 [英] R parallel cluster worker process never returns

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭