无法打开并行集群的套接字 [英] can't open sockets for parallel cluster

查看:71
本文介绍了无法打开并行集群的套接字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用parallel软件包,但发现makeCluster无法完成.我已经将挂起追踪到newPSOCKnode中的以下行:

I am trying to use the parallel package, and found that makeCluster fails to complete. I've traced the hang to the following line in newPSOCKnode :

con <- socketConnection("localhost", port = port, server = TRUE, 
    blocking = TRUE, open = "a+b", timeout = timeout)

该命令停止运行(默认超时为大值).我的怀疑是这是由于在我们的工作计算机上制定了一些过分的IT规则",但是我们欢迎任何有关如何跟踪(和修复)问题根源的建议.这是Windows7-64,企业",R 3.0.1.

That command stalls (granted the default timeout is a large value). My suspicion is this is due to some "overzealous IT rules" laid down on our work computers, but would welcome any suggestions as to how to trace (and fix) the source of the problem. This is Windows7-64, "Enterprise", R 3.0.1 .

更多信息:在调试会话中,我设置了timeout < - 10,但是它仍然挂起-好像socketConnection被困在了一个甚至无法检查超时值的地方.

More info: inside debugging session, I set timeout < - 10, but it still hangs -- as though socketConnection is getting trapped somewhere that it can't even check the timeout value.

这是我与Richie Cotton数据相同的转储点:

Here's my dump at the same point as Richie Cotton's data:

Browse[3]> ls.str()
arg :  chr "parallel:::.slaveRSOCK()"
cmd :  chr "\"C:/Users/carl.witthoft/Documents/R/R-3.0.1/bin/x64/Rscript\" -e \"parallel:::.slaveRSOCK()\" MASTER=localhost PORT=11017 OUT="| __truncated__
env :  chr "MASTER=localhost PORT=11017 OUT=/dev/null TIMEOUT=2592000 METHODS=TRUE XDR=TRUE"
machine :  chr "localhost"
manual :  logi FALSE
master :  chr "localhost"
methods :  logi TRUE
options : <environment: 0x000000000ccac6a0> 
outfile :  chr "/dev/null"
port :  int 11017
rank :  int 1
renice :  int NA
rscript :  chr "\"C:/Users/carl.witthoft/Documents/R/R-3.0.1/bin/x64/Rscript\""
timeout :  num 2592000
useXDR :  logi TRUE

因此,除了一个不同的端口号之外,我认为所有内容都匹配.

So aside from a different port number, I think everything matches up.

下一个技巧:我打开一个外壳并运行netsh advfirewall firewall add rule name="Open Port 11017" dir=in action=allow protocol=TCP localport=11017,并得到确定"响应. 我运行netstat -a -n并发现以下行:

Next trick: I opened a shell and ran netsh advfirewall firewall add rule name="Open Port 11017" dir=in action=allow protocol=TCP localport=11017 and got an "OK" response. I ran netstat -a -n and found the following line:

TCP 0.0.0.0:11017 0.0.0.0:0 LISTENING

但是运行makePSOCKcluster仍挂在同一位置.

But running makePSOCKcluster still hangs at the same place.

下一个: 我尝试从命令行(通过cygwin bash)运行R,得到的错误消息是Error in loadhistory(file) : no history mechanism available Execution halted,然后-C使我回到R提示符.

NEXT: I tried running R from the command line (via cygwin bash), and the error message I get is Error in loadhistory(file) : no history mechanism available Execution halted , after which -C returns me to the R-prompt.

推荐答案

您所描述的是PSOCK群集的经典问题: makeCluster挂起.挂起它的原因有很多,因为它必须创建称为工作程序"进程的所有进程,这些进程将执行集群"的实际工作,并且涉及使用Rscript命令启动新的R会话,该脚本将执行.slaveRSOCK函数,它将创建一个与主机的套接字连接,然后执行slaveLoop函数,该函数最终将执行主机发送给它的任务.如果启动任何工作进程时发生任何错误(并且相信我:很多事情都会出错),则主服务器将在执行socketConnection时挂起,等待该工作程序连接,即使该工作程序可能已经死亡甚至从未死亡已成功创建.

What you're describing is the classic problem with PSOCK clusters: makeCluster hangs. It can hang for dozens of reasons because it has to create all of the processes, called "worker" processes, that will perform the actual work of the "cluster", and that involves starting new R sessions using the Rscript command that will execute the .slaveRSOCK function, which will create a socket connection back to the master and then execute the slaveLoop function where it will eventually execute the tasks sent to it by the master. If anything goes wrong starting any of the worker processes (and trust me: a lot can go wrong), the master will hang while executing socketConnection, waiting for the worker to connect to it even though that worker may have died or never even been created successfully.

对于许多故障情况,使用outfile参数非常有用,因为它经常揭示导致工作进程死亡并因此导致主进程挂起的错误.但是,如果没有任何反应,我将进入手动模式.在手动模式下,主服务器打印命令以启动每个工作程序,而不是自己执行命令.这是更多的工作,但是它可以为您提供完全的控制,甚至可以根据需要调试工作器.

For many failure scenarios, using the outfile argument is great because it often reveals the error that causes the worker process to die and thus the master to hang. But if that reveals nothing, I go to manual mode. In manual mode, the master prints the command to start each worker instead of executing the command itself. It's more work, but it gives you complete control, and you can even debug into the workers if you need to.

这是一个例子:

> library(parallel)
> cl <- makePSOCKcluster(1, manual=TRUE, outfile='log.txt')
Manually start worker on localhost with
   '/usr/lib/R/bin/Rscript' -e 'parallel:::.slaveRSOCK()' MASTER=localhost
PORT=10187 OUT=log.txt TIMEOUT=2592000 METHODS=TRUE XDR=TRUE 

这时,您的R会话已挂起,因为它正在执行socketConnection,正如您所描述的.现在,您要打开一个新的终端窗口(命令提示符或其他任何内容),然后粘贴该Rscript命令.执行完后,makePSOCKcluster应该返回,因为我们只请求了一个工作程序.当然,如果出现问题,它不会返回,但是如果您很幸运,您会在终端窗口中看到一条错误消息,并且有一条重要的线索有望帮助您解决问题.如果您不太幸运,Rscript命令也会挂起,并且您必须更深入.

At this point, your R session is hung because it's executing socketConnection, just as you described. It's now your job to open a new terminal window (command prompt, or whatever), and paste in that Rscript command. As soon as you've executed it, makePSOCKcluster should return since we only requested one worker. Of course, if something goes wrong, it won't return, but if you're lucky, you'll get an error message in your terminal window and you'll have an important clue that will hopefully lead to a solution to your problem. If you're not so lucky, the Rscript command will also hang, and you'll have to dive in even deeper.

要调试工作程序,由于需要交互会话,因此不执行显示的Rscript命令.而是使用以下命令启动R会话:

To debug the worker, you don't execute the displayed Rscript command because you need an interactive session. Instead, you start an R session with a command such as:

$ R --vanilla --args MASTER=localhost PORT=10187 OUT=log.txt TIMEOUT=2592000 METHODS=TRUE XDR=TRUE

在该R会话中,您可以在.slaveRSOCK函数上放置一个断点,然后执行它:

In that R session, you can put a breakpoint on the .slaveRSOCK function and then execute it:

> debug(parallel:::.slaveRSOCK)
> parallel:::.slaveRSOCK()

现在,您可以开始逐步执​​行代码,可能在slaveLoopmakeSOCKmaster函数上设置断点.在您的情况下,我认为尝试建立套接字连接将挂起,在这种情况下,您的问题的标题将是适当的.

Now you can start stepping through the code, possibly setting breakpoints on the slaveLoop and makeSOCKmaster functions. In your case, I assume that it will hang trying to create the socket connection, in which case the title of your question will be appropriate.

有关此类问题的更多信息,请参见我对类似问题的回答.

For more information on this kind of problem, see my answer to a similar question.

更新

现在,此特定问题已解决,我可以添加两个调试makePSOCKcluster问题的技巧:

Now that this particular problem has been resolved, I can add two tips for debugging makePSOCKcluster problems:

  • 检查您的.Rprofile中是否有任何内容仅适用于交互模式
  • 在Windows上,请使用Rterm命令而不是Rgui,以便您更有可能看到错误消息并使用outfile=''进行输出.
  • Check to see if anything in your .Rprofile only works in interactive mode
  • On Windows, use the Rterm command rather than Rgui so that you're more likely to see error messages and output from using outfile=''.

这篇关于无法打开并行集群的套接字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆