无法为并行集群打开套接字 [英] can't open sockets for parallel cluster

查看:25
本文介绍了无法为并行集群打开套接字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 parallel 包,发现 makeCluster 无法完成.我已经将挂起追溯到 newPSOCKnode 中的以下行:

I am trying to use the parallel package, and found that makeCluster fails to complete. I've traced the hang to the following line in newPSOCKnode :

con <- socketConnection("localhost", port = port, server = TRUE, 
    blocking = TRUE, open = "a+b", timeout = timeout)

该命令停止(授予默认超时是一个大值).我怀疑这是由于我们的工作计算机上制定了一些过于热心的 IT 规则",但欢迎任何关于如何追踪(和修复)问题根源的建议.这是 Windows7-64,企业",R 3.0.1.

That command stalls (granted the default timeout is a large value). My suspicion is this is due to some "overzealous IT rules" laid down on our work computers, but would welcome any suggestions as to how to trace (and fix) the source of the problem. This is Windows7-64, "Enterprise", R 3.0.1 .

更多信息:在调试会话中,我设置了 timeout <- 10,但它仍然挂起——好像 socketConnection 被困在某个地方,它甚至无法检查超时值.

More info: inside debugging session, I set timeout < - 10, but it still hangs -- as though socketConnection is getting trapped somewhere that it can't even check the timeout value.

这是我与 Richie Cotton 数据相同的转储:

Here's my dump at the same point as Richie Cotton's data:

Browse[3]> ls.str()
arg :  chr "parallel:::.slaveRSOCK()"
cmd :  chr ""C:/Users/carl.witthoft/Documents/R/R-3.0.1/bin/x64/Rscript" -e "parallel:::.slaveRSOCK()" MASTER=localhost PORT=11017 OUT="| __truncated__
env :  chr "MASTER=localhost PORT=11017 OUT=/dev/null TIMEOUT=2592000 METHODS=TRUE XDR=TRUE"
machine :  chr "localhost"
manual :  logi FALSE
master :  chr "localhost"
methods :  logi TRUE
options : <environment: 0x000000000ccac6a0> 
outfile :  chr "/dev/null"
port :  int 11017
rank :  int 1
renice :  int NA
rscript :  chr ""C:/Users/carl.witthoft/Documents/R/R-3.0.1/bin/x64/Rscript""
timeout :  num 2592000
useXDR :  logi TRUE

所以除了端口号不同,我认为一切都匹配.

So aside from a different port number, I think everything matches up.

下一个技巧:我打开一个 shell 并运行 netsh advfirewall firewall add rule name="Open Port 11017" dir=in action=allow protocol=TCP localport=11017 并得到OK"响应.我运行 netstat -a -n 并找到以下行:

Next trick: I opened a shell and ran netsh advfirewall firewall add rule name="Open Port 11017" dir=in action=allow protocol=TCP localport=11017 and got an "OK" response. I ran netstat -a -n and found the following line:

TCP 0.0.0.0:11017 0.0.0.0:0 监听

但是运行 makePSOCKcluster 仍然挂在同一个地方.

But running makePSOCKcluster still hangs at the same place.

下一个:我尝试从命令行运行 R(通过 cygwin bash),我得到的错误消息是 Error in loadhistory(file) : no history mechanism available执行暂停,之后 -C 将我返回到 R 提示.

NEXT: I tried running R from the command line (via cygwin bash), and the error message I get is Error in loadhistory(file) : no history mechanism available Execution halted , after which -C returns me to the R-prompt.

推荐答案

您所描述的是 PSOCK 集群的经典问题:makeCluster 挂起.它可能因多种原因挂起,因为它必须创建所有进程,称为工作"进程,这些进程将执行集群"的实际工作,并且涉及使用 Rscript 命令启动新的 R 会话,该命令将执行.slaveRSOCK 函数,它将创建一个返回到 master 的 socket 连接,然后执行 slaveLoop 函数,它最终将执行 master 发送给它的任务.如果启动任何工作进程出现任何问题(相信我:很多可能会出错),主进程将在执行 socketConnection 时挂起,等待工作进程连接到它,即使该工作进程可能已死亡或从未成功创建.

What you're describing is the classic problem with PSOCK clusters: makeCluster hangs. It can hang for dozens of reasons because it has to create all of the processes, called "worker" processes, that will perform the actual work of the "cluster", and that involves starting new R sessions using the Rscript command that will execute the .slaveRSOCK function, which will create a socket connection back to the master and then execute the slaveLoop function where it will eventually execute the tasks sent to it by the master. If anything goes wrong starting any of the worker processes (and trust me: a lot can go wrong), the master will hang while executing socketConnection, waiting for the worker to connect to it even though that worker may have died or never even been created successfully.

对于许多失败场景,使用 outfile 参数是很好的,因为它经常揭示导致工作进程死亡并因此主进程挂起的错误.但如果这没有任何显示,我会进入手动模式.在手动模式下,master 打印命令以启动每个 worker,而不是执行命令本身.这是更多的工作,但它可以让您完全控制,如果需要,您甚至可以调试到工作程序中.

For many failure scenarios, using the outfile argument is great because it often reveals the error that causes the worker process to die and thus the master to hang. But if that reveals nothing, I go to manual mode. In manual mode, the master prints the command to start each worker instead of executing the command itself. It's more work, but it gives you complete control, and you can even debug into the workers if you need to.

这是一个例子:

> library(parallel)
> cl <- makePSOCKcluster(1, manual=TRUE, outfile='log.txt')
Manually start worker on localhost with
   '/usr/lib/R/bin/Rscript' -e 'parallel:::.slaveRSOCK()' MASTER=localhost
PORT=10187 OUT=log.txt TIMEOUT=2592000 METHODS=TRUE XDR=TRUE 

此时,您的 R 会话挂起,因为它正在执行 socketConnection,正如您所描述的.现在您的工作是打开一个新的终端窗口(命令提示符或其他),然后粘贴该 Rscript 命令.一旦你执行了它,makePSOCKcluster 应该返回,因为我们只请求了一个工人.当然,如果出现问题,它不会返回,但如果幸运的话,您会在终端窗口中收到一条错误消息,并且您将获得一个重要的线索,希望能找到解决您问题的方法.如果您不那么幸运,Rscript 命令也会挂起,您将不得不深入研究.

At this point, your R session is hung because it's executing socketConnection, just as you described. It's now your job to open a new terminal window (command prompt, or whatever), and paste in that Rscript command. As soon as you've executed it, makePSOCKcluster should return since we only requested one worker. Of course, if something goes wrong, it won't return, but if you're lucky, you'll get an error message in your terminal window and you'll have an important clue that will hopefully lead to a solution to your problem. If you're not so lucky, the Rscript command will also hang, and you'll have to dive in even deeper.

要调试工作程序,您不需要执行显示的 Rscript 命令,因为您需要一个交互式会话.相反,您可以使用如下命令启动 R 会话:

To debug the worker, you don't execute the displayed Rscript command because you need an interactive session. Instead, you start an R session with a command such as:

$ R --vanilla --args MASTER=localhost PORT=10187 OUT=log.txt TIMEOUT=2592000 METHODS=TRUE XDR=TRUE

在那个 R 会话中,您可以在 .slaveRSOCK 函数上放置一个断点,然后执行它:

In that R session, you can put a breakpoint on the .slaveRSOCK function and then execute it:

> debug(parallel:::.slaveRSOCK)
> parallel:::.slaveRSOCK()

现在您可以开始逐步执​​行代码,可能会在 slaveLoopmakeSOCKmaster 函数上设置断点.在您的情况下,我假设它会在尝试创建套接字连接时挂起,在这种情况下,您的问题标题将是合适的.

Now you can start stepping through the code, possibly setting breakpoints on the slaveLoop and makeSOCKmaster functions. In your case, I assume that it will hang trying to create the socket connection, in which case the title of your question will be appropriate.

有关此类问题的更多信息,请参阅我对类似问题的回答.

For more information on this kind of problem, see my answer to a similar question.

更新

既然这个特定问题已经解决,我可以添加两个调试makePSOCKcluster问题的技巧:

Now that this particular problem has been resolved, I can add two tips for debugging makePSOCKcluster problems:

  • 检查您的 .Rprofile 中的任何内容是否仅适用于交互模式
  • 在 Windows 上,使用 Rterm 命令而不是 Rgui,这样您更有可能看到使用 outfile='' 的错误消息和输出.
  • Check to see if anything in your .Rprofile only works in interactive mode
  • On Windows, use the Rterm command rather than Rgui so that you're more likely to see error messages and output from using outfile=''.

这篇关于无法为并行集群打开套接字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆