R:在doParallel中建立群集/降雪挂起 [英] R: making cluster in doParallel / snowfall hangs

查看:218
本文介绍了R:在doParallel中建立群集/降雪挂起的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在局域网上有两台服务器,新安装了Centos 6.4 minimum和R 3.0.1.两台计算机都安装了doParallel,snow和snowfall软件包.

I've got two servers on a LAN with fresh installs of Centos 6.4 minimal and R 3.0.1. Both computers have doParallel, snow, and snowfall packages installed.

服务器可以相互ssh罚款.

The servers can ssh to each other fine.

当我尝试在任一方向上建立集群时,都会提示您输入密码,但是输入密码后,它会无限次地挂在那里.

When I attempt to make clusters in either direction, I get a prompt for a password, but after entering the password, it just hangs there indefinately.

makePSOCKcluster("192.168.1.1",user="username")

如何解决此问题?

我还尝试在上述计算机上调用makePSOCKcluster,该计算机具有可以用作从设备(从其他计算机)用作从设备的主机,但是它仍然挂起.那么,是否可能存在防火墙问题?我还尝试通过端口22使用makePSOCKcluster:

I also tried calling makePSOCKcluster on the above-mentioned computer with a host that IS capable of being used as a slave (from other computers), but it still hangs. So, is it possible there is a firewall issue? I also tried using makePSOCKcluster with port 22:

> makePSOCKcluster("192.168.1.1",user="username",port=22)
Error in socketConnection("localhost", port = port, server = TRUE, blocking = TRUE,  :
  cannot open the connection
In addition: Warning message:
In socketConnection("localhost", port = port, server = TRUE, blocking = TRUE,  :
  port 22 cannot be opened

这是我的iptables

here's my iptables

# Firewall configuration written by system-config-firewall
# Manual customization of this file is not recommended.
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
-A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
-A INPUT -p icmp -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j ACCEPT
-A INPUT -j REJECT --reject-with icmp-host-prohibited
-A FORWARD -j REJECT --reject-with icmp-host-prohibited
COMMIT

推荐答案

在创建集群对象时,您可以将"outfile"选项设置为空字符串来开始:

You could start by setting the "outfile" option to an empty string when creating the cluster object:

makePSOCKcluster("192.168.1.1",user="username",outfile="")

这使您可以查看终端机中工作人员的错误消息,从而有望为问题提供线索.如果那没有帮助,我建议使用手动模式:

This allows you to see error messages from the workers in your terminal, which will hopefully provide a clue to the problem. If that doesn't help, I recommend using manual mode:

makePSOCKcluster("192.168.1.1",user="username",outfile="",manual=TRUE)

这会绕过ssh,并显示命令供您执行,以便在单独的终端中手动启动每个worker.这样可以发现未安装的R软件包等问题.它还需要您选择任何调试工具来调试工作程序,尽管这需要一些工作.

This bypasses ssh, and displays commands for you to execute in order to manually start each of the workers in separate terminals. This can uncover problems such as R packages that are not installed. It also allows you to debug the workers using whatever debugging tools you choose, although that takes a bit of work.

如果在执行指定的命令后makePSOCKcluster没有响应,则表明该工作程序无法连接到主进程.如果工作程序未显示任何错误消息,则可能表明网络问题,可能是由于防火墙阻止了连接.由于makePSOCKcluster默认在R 3.X中使用随机端口,因此您应该为端口指定一个显式值,并将防火墙配置为允许连接到该端口.

If makePSOCKcluster doesn't respond after you execute the specified command, it means that the worker wasn't able to connect to the master process. If the worker doesn't display any error message, it may indicate a networking problem, possibly due to a firewall blocking the connection. Since makePSOCKcluster uses a random port by default in R 3.X, you should specify an explicit value for port and configure your firewall to allow connections to that port.

要测试网络或防火墙问题,您可以尝试使用"netcat"连接到主进程.在手动模式下执行makePSOCKcluster,指定所需工作线程主机的主机名以及本地计算机上应允许传入连接的端口:

To test for networking or firewall problems, you could try connecting to the master process using "netcat". Execute makePSOCKcluster in manual mode, specifying the hostname of the desired worker host and the port on local machine that should allow incoming connections:

> library(parallel)
> makePSOCKcluster("node03", port=11234, manual=TRUE)
Manually start worker on node03 with
   '/usr/lib/R/bin/Rscript' -e 'parallel:::.slaveRSOCK()' MASTER=node01
PORT=11234 OUT=/dev/null TIMEOUT=2592000 METHODS=TRUE XDR=TRUE 

现在在"node03"上启动终端会话,并使用"MASTER"和"PORT"的指示值作为参数执行"nc":

Now start a terminal session on "node03" and execute "nc" using the indicated values of "MASTER" and "PORT" as arguments:

node03$ nc node01 11234

主进程应立即返回以下消息:

The master process should immediately return with the message:

socket cluster with 1 nodes on host ‘node03’

netcat应该不显示任何消息,因为它正在从套接字连接中静默读取.

while netcat should display no message, since it is quietly reading from the socket connection.

但是,如果netcat显示消息:

However, if netcat displays the message:

nc: getaddrinfo: Name or service not known

然后您有一个主机名解析问题.如果可以找到适用于netcat的主机名,则可以通过主机"选项makePSOCKcluster("node03", master="node01", port=11234)指定该名称,从而使makePSOCKcluster正常工作.

then you have a hostname resolution problem. If you can find a hostname that does work with netcat, you may be able to get makePSOCKcluster to work by specifying that name via the "master" option: makePSOCKcluster("node03", master="node01", port=11234).

如果netcat立即返回,则可能表明它无法连接到指定的端口.如果一两分钟后返回,则可能表明它根本无法与指定的主机进行通信.无论哪种情况,都请检查netcat的返回值以确认它是错误的:

If netcat returns immediately, that may indicate that it wasn't able to connect to the specified port. If it returns after a minute or two, that may indicate that it wasn't able to communicate with specified host at all. In either case, check netcat's return value to verify that it was an error:

node03$ echo $?
1

希望这将为您提供有关该问题的足够信息,您可以从网络管理员那里获得帮助.

Hopefully that will give you enough information about the problem that you can get help from a network administrator.

这篇关于R:在doParallel中建立群集/降雪挂起的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆