无法为并行集群打开套接字 [英] can't open sockets for parallel cluster
问题描述
我正在尝试使用 parallel
包,发现 makeCluster
无法完成.我已经将挂起追溯到 newPSOCKnode
中的以下行:
I am trying to use the parallel
package, and found that makeCluster
fails to complete. I've traced the hang to the following line in newPSOCKnode
:
con <- socketConnection("localhost", port = port, server = TRUE,
blocking = TRUE, open = "a+b", timeout = timeout)
该命令停止(授予默认超时是一个大值).我怀疑这是由于我们的工作计算机上制定了一些过于热心的 IT 规则",但欢迎任何关于如何追踪(和修复)问题根源的建议.这是 Windows7-64,企业",R 3.0.1.
That command stalls (granted the default timeout is a large value). My suspicion is this is due to some "overzealous IT rules" laid down on our work computers, but would welcome any suggestions as to how to trace (and fix) the source of the problem. This is Windows7-64, "Enterprise", R 3.0.1 .
更多信息:在调试会话中,我设置了 timeout <- 10
,但它仍然挂起——好像 socketConnection
被困在某个地方,它甚至无法检查超时值.
More info: inside debugging session, I set timeout < - 10
, but it still hangs -- as though socketConnection
is getting trapped somewhere that it can't even check the timeout value.
这是我与 Richie Cotton 数据相同的转储:
Here's my dump at the same point as Richie Cotton's data:
Browse[3]> ls.str()
arg : chr "parallel:::.slaveRSOCK()"
cmd : chr ""C:/Users/carl.witthoft/Documents/R/R-3.0.1/bin/x64/Rscript" -e "parallel:::.slaveRSOCK()" MASTER=localhost PORT=11017 OUT="| __truncated__
env : chr "MASTER=localhost PORT=11017 OUT=/dev/null TIMEOUT=2592000 METHODS=TRUE XDR=TRUE"
machine : chr "localhost"
manual : logi FALSE
master : chr "localhost"
methods : logi TRUE
options : <environment: 0x000000000ccac6a0>
outfile : chr "/dev/null"
port : int 11017
rank : int 1
renice : int NA
rscript : chr ""C:/Users/carl.witthoft/Documents/R/R-3.0.1/bin/x64/Rscript""
timeout : num 2592000
useXDR : logi TRUE
所以除了端口号不同,我认为一切都匹配.
So aside from a different port number, I think everything matches up.
下一个技巧:我打开一个 shell 并运行 netsh advfirewall firewall add rule name="Open Port 11017" dir=in action=allow protocol=TCP localport=11017
并得到OK"响应.我运行 netstat -a -n
并找到以下行:
Next trick: I opened a shell and ran netsh advfirewall firewall add rule name="Open Port 11017" dir=in action=allow protocol=TCP localport=11017
and got an "OK" response.
I ran netstat -a -n
and found the following line:
TCP 0.0.0.0:11017 0.0.0.0:0 监听
但是运行 makePSOCKcluster
仍然挂在同一个地方.
But running makePSOCKcluster
still hangs at the same place.
下一个:我尝试从命令行运行 R
(通过 cygwin bash),我得到的错误消息是 Error in loadhistory(file) : no history mechanism available执行暂停
,之后 -C 将我返回到 R 提示.
NEXT:
I tried running R
from the command line (via cygwin bash), and the error message I get is Error in loadhistory(file) : no history mechanism available
Execution halted
, after which -C returns me to the R-prompt.
推荐答案
您所描述的是 PSOCK 集群的经典问题:makeCluster
挂起.它可能因多种原因挂起,因为它必须创建所有进程,称为工作"进程,这些进程将执行集群"的实际工作,并且涉及使用 Rscript 命令启动新的 R 会话,该命令将执行.slaveRSOCK
函数,它将创建一个返回到 master 的 socket 连接,然后执行 slaveLoop
函数,它最终将执行 master 发送给它的任务.如果启动任何工作进程出现任何问题(相信我:很多可能会出错),主进程将在执行 socketConnection
时挂起,等待工作进程连接到它,即使该工作进程可能已死亡或从未成功创建.
What you're describing is the classic problem with PSOCK clusters: makeCluster
hangs. It can hang for dozens of reasons because it has to create all of the processes, called "worker" processes, that will perform the actual work of the "cluster", and that involves starting new R sessions using the Rscript command that will execute the .slaveRSOCK
function, which will create a socket connection back to the master and then execute the slaveLoop
function where it will eventually execute the tasks sent to it by the master. If anything goes wrong starting any of the worker processes (and trust me: a lot can go wrong), the master will hang while executing socketConnection
, waiting for the worker to connect to it even though that worker may have died or never even been created successfully.
对于许多失败场景,使用 outfile
参数是很好的,因为它经常揭示导致工作进程死亡并因此主进程挂起的错误.但如果这没有任何显示,我会进入手动模式.在手动模式下,master 打印命令以启动每个 worker,而不是执行命令本身.这是更多的工作,但它可以让您完全控制,如果需要,您甚至可以调试到工作程序中.
For many failure scenarios, using the outfile
argument is great because it often reveals the error that causes the worker process to die and thus the master to hang. But if that reveals nothing, I go to manual mode. In manual mode, the master prints the command to start each worker instead of executing the command itself. It's more work, but it gives you complete control, and you can even debug into the workers if you need to.
这是一个例子:
> library(parallel)
> cl <- makePSOCKcluster(1, manual=TRUE, outfile='log.txt')
Manually start worker on localhost with
'/usr/lib/R/bin/Rscript' -e 'parallel:::.slaveRSOCK()' MASTER=localhost
PORT=10187 OUT=log.txt TIMEOUT=2592000 METHODS=TRUE XDR=TRUE
此时,您的 R 会话挂起,因为它正在执行 socketConnection
,正如您所描述的.现在您的工作是打开一个新的终端窗口(命令提示符或其他),然后粘贴该 Rscript 命令.一旦你执行了它,makePSOCKcluster
应该返回,因为我们只请求了一个工人.当然,如果出现问题,它不会返回,但如果幸运的话,您会在终端窗口中收到一条错误消息,并且您将获得一个重要的线索,希望能找到解决您问题的方法.如果您不那么幸运,Rscript 命令也会挂起,您将不得不深入研究.
At this point, your R session is hung because it's executing socketConnection
, just as you described. It's now your job to open a new terminal window (command prompt, or whatever), and paste in that Rscript command. As soon as you've executed it, makePSOCKcluster
should return since we only requested one worker. Of course, if something goes wrong, it won't return, but if you're lucky, you'll get an error message in your terminal window and you'll have an important clue that will hopefully lead to a solution to your problem. If you're not so lucky, the Rscript command will also hang, and you'll have to dive in even deeper.
要调试工作程序,您不需要执行显示的 Rscript 命令,因为您需要一个交互式会话.相反,您可以使用如下命令启动 R 会话:
To debug the worker, you don't execute the displayed Rscript command because you need an interactive session. Instead, you start an R session with a command such as:
$ R --vanilla --args MASTER=localhost PORT=10187 OUT=log.txt TIMEOUT=2592000 METHODS=TRUE XDR=TRUE
在那个 R 会话中,您可以在 .slaveRSOCK
函数上放置一个断点,然后执行它:
In that R session, you can put a breakpoint on the .slaveRSOCK
function and then execute it:
> debug(parallel:::.slaveRSOCK)
> parallel:::.slaveRSOCK()
现在您可以开始逐步执行代码,可能会在 slaveLoop
和 makeSOCKmaster
函数上设置断点.在您的情况下,我假设它会在尝试创建套接字连接时挂起,在这种情况下,您的问题标题将是合适的.
Now you can start stepping through the code, possibly setting breakpoints on the slaveLoop
and makeSOCKmaster
functions. In your case, I assume that it will hang trying to create the socket connection, in which case the title of your question will be appropriate.
有关此类问题的更多信息,请参阅我对类似问题的回答.
For more information on this kind of problem, see my answer to a similar question.
更新
既然这个特定问题已经解决,我可以添加两个调试makePSOCKcluster
问题的技巧:
Now that this particular problem has been resolved, I can add two tips for debugging makePSOCKcluster
problems:
- 检查您的 .Rprofile 中的任何内容是否仅适用于交互模式
- 在 Windows 上,使用 Rterm 命令而不是 Rgui,这样您更有可能看到使用
outfile=''
的错误消息和输出.
- Check to see if anything in your .Rprofile only works in interactive mode
- On Windows, use the Rterm command rather than Rgui so that you're more likely to see error messages and output from using
outfile=''
.
这篇关于无法为并行集群打开套接字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!