在两台主机上运行MPI [英] Running MPI on two hosts

查看:149
本文介绍了在两台主机上运行MPI的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我浏览了许多示例,但仍然感到困惑.我已经从此处编译了一个简单的延迟检查程序它可以在一台主机上完美运行,但是当我尝试在两台主机上运行时会挂起.但是,运行hostname之类的命令可以正常运行:

I've looked through many examples and I'm still confused. I've compiled a simple latency check program from here, and it runs perfectly on one host, but when I try to run it on two hosts it hangs. However, running something like hostname runs fine:

[hamiltont@4 latency]$ mpirun --report-bindings --hostfile hostfile --rankfile rankfile -np 2 hostname
[4:16622] [[5908,0],0] odls:default:fork binding child [[5908,1],0] to slot_list 0
4
[5:12661] [[5908,0],1] odls:default:fork binding child [[5908,1],1] to slot_list 0
5

但这是已编译的延迟程序:

But here is the compiled latency program:

[hamiltont@4 latency]$ mpirun --report-bindings --hostfile hostfile --rankfile rankfile -np 2 latency 
[4:16543] [[5989,0],0] odls:default:fork binding child [[5989,1],0] to slot_list 0
[5:12582] [[5989,0],1] odls:default:fork binding child [[5989,1],1] to slot_list 0
[4][[5989,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 10.0.2.5 failed: Connection timed out (110)

我目前的猜测是我的防火墙规则有问题(例如,主机名在主机之间不进行通信,而延迟程序在进行通信).

My current guess is that there is something wrong with my firewall rules (e.g. hostname does not communicate between hosts, but the latency program does).

[hamiltont@4 latency]$ cat rankfile
rank 0=10.0.2.4 slot=0
rank 1=10.0.2.5 slot=0
[hamiltont@4 latency]$ cat hostfile 
10.0.2.4 slots=2
10.0.2.5 slots=2

推荐答案

运行Open MPI作业涉及两种通信.首先必须启动工作. Open MPI使用特殊的框架来支持多种启动,您可能正在通过SSH使用rsh远程登录启动机制.显然,您的防火墙已正确设置为允许SSH连接.

There are two kinds of communication involved in running an Open MPI job. First the job has to be launched. Open MPI uses a special framework to support many kinds of launches and you are probably using the rsh remote login launch mechanism over SSH. Obviously your firewall is correctly set up to allow SSH connections.

启动Open MPI作业并且进程是真正的MPI程序时,它们会连接回生成该作业的mpirun进程,并了解该作业中的其他进程的所有信息,最重要的是,每个进程中的可用网络端点过程.此消息:

When an Open MPI job is launched and the processes are true MPI programs, they connect back to the mpirun process that spawned the job and learn all about the other processes in the job, most importantly the available network endpoints at each process. This message:

[4][[5989,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 10.0.2.5 failed: Connection timed out (110)

指示在主机4上运行的进程无法打开与在主机5上运行的进程的TCP连接.最常见的原因是防火墙的存在,它限制了入站连接.因此,检查防火墙是第一件事.

indicates that the process which runs on host 4 is unable to open a TCP connection to the process which runs on host 5. The most common reason for that is the presence of a firewall, which limits the inbound connections. So checking your firewall is the first thing to do.

另一个常见原因是,如果在两个节点上都配置了其他网络接口,并且这些接口都已启动且具有兼容的网络地址,但无法在它们之间建立连接.这通常发生在较新的Linux设置中,默认情况下会启动各种虚拟和/或隧道接口.可以指示Open MPI通过在btl_tcp_if_exclude MCA参数中将它们列出(作为接口名称或CIDR网络地址)来跳过这些接口,例如:

Another common reason is if on both nodes there are additional network interfaces configured and up, with compatible network addresses, but without the possibility to establish connection between them. This often happens on newer Linux setups where various virtual and/or tunnelling interfaces are being brought up by default. One can instruct Open MPI to skip those interfaces by listing them (either as interface names or as CIDR network addresses) in the btl_tcp_if_exclude MCA parameter, e.g.:

$ mpirun --mca btl_tcp_if_exclude "127.0.0.1/8,tun0" ...

(如果设置btl_tcp_if_exclude,则总是必须添加回送接口)

(one always have to add the loopback interface if setting btl_tcp_if_exclude)

或者可以通过在btl_tcp_if_include MCA参数中列出它们来显式指定用于通信的接口:

or one can explicitly specify which interfaces to be used for communication by listing them in the btl_tcp_if_include MCA parameter:

$ mpirun --mca btl_tcp_if_include eth0 ...

由于错误消息中的IP地址与主机文件中第二个主机的地址匹配,因此问题必须出自活动的防火墙规则.

Since the IP address in the error message matches the address of your second host in the hostfile, then the problem must come from an active firewall rule.

这篇关于在两台主机上运行MPI的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆