传输大数据时无法运行MPI [英] Unable to run MPI when transfering large data
问题描述
我用MPI_Isend
将一个字符数组传输到从属节点.当数组的大小很小时,它可以工作,但是当我扩大数组的大小时,它就挂在那里了.
I used MPI_Isend
to transfer an array of chars to slave node. When the size of the array is small it worked, but when I enlarge the size of the array, it hanged there.
在主节点(等级0)上运行的代码:
Code running on the master node (rank 0) :
MPI_Send(&text_length,1,MPI_INT,dest,MSG_TEXT_LENGTH,MPI_COMM_WORLD);
MPI_Isend(text->chars, 360358,MPI_CHAR,dest,MSG_SEND_STRING,MPI_COMM_WORLD,&request);
MPI_Wait(&request,&status);
在从属节点(等级1)上运行的代码:
Code running on slave node (rank 1):
MPI_Recv(&count,1,MPI_INT,0,MSG_TEXT_LENGTH,MPI_COMM_WORLD,&status);
MPI_Irecv(host_read_string,count,MPI_CHAR,0,MSG_SEND_STRING,MPI_COMM_WORLD,&request);
MPI_Wait(&request,&status);
您看到MPI_Isend
中的count参数是360358
.对于MPI
来说似乎太大了.当我设置参数1024
时,效果很好.
You see the count param in MPI_Isend
is 360358
. It seemed too large for MPI
. When I set the param 1024
, it worked well.
实际上,这个问题使我困惑了几天,我知道MPI
传输的数据大小受到限制.但据我所知,MPI_Send
用于发送短消息,而MPI_Isend
可以发送较大的消息.所以我用MPI_Isend
.
Actually this problem has confused me a few days, I have known that there's limit on the size of data transferred by MPI
. But as far as I know, the MPI_Send
is used to send short messages, and the MPI_Isend
can send larger messages. So I use MPI_Isend
.
排名为0的网络配置为:
The network configure in rank 0 is:
[12t2007@comp01-mpi.gpu01.cis.k.hosei.ac.jp ~]$ ifconfig -a
eth0 Link encap:Ethernet HWaddr 00:1B:21:D9:79:A5
inet addr:192.168.0.101 Bcast:192.168.0.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:393267 errors:0 dropped:0 overruns:0 frame:0
TX packets:396421 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:35556328 (33.9 MiB) TX bytes:79580008 (75.8 MiB)
eth0.2002 Link encap:Ethernet HWaddr 00:1B:21:D9:79:A5
inet addr:10.111.2.36 Bcast:10.111.2.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:133577 errors:0 dropped:0 overruns:0 frame:0
TX packets:127677 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:14182652 (13.5 MiB) TX bytes:17504189 (16.6 MiB)
eth1 Link encap:Ethernet HWaddr 00:1B:21:D9:79:A4
inet addr:192.168.1.101 Bcast:192.168.1.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:206981 errors:0 dropped:0 overruns:0 frame:0
TX packets:303185 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:168952610 (161.1 MiB) TX bytes:271792020 (259.2 MiB)
eth2 Link encap:Ethernet HWaddr 00:25:90:91:6B:56
inet addr:10.111.1.36 Bcast:10.111.1.255 Mask:255.255.254.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:26459977 errors:0 dropped:0 overruns:0 frame:0
TX packets:15700862 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:12533940345 (11.6 GiB) TX bytes:2078001873 (1.9 GiB)
Memory:fb120000-fb140000
eth3 Link encap:Ethernet HWaddr 00:25:90:91:6B:57
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Memory:fb100000-fb120000
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:1894012 errors:0 dropped:0 overruns:0 frame:0
TX packets:1894012 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:154962344 (147.7 MiB) TX bytes:154962344 (147.7 MiB)
排名1的网络配置为:
[12t2007@comp02-mpi.gpu01.cis.k.hosei.ac.jp ~]$ ifconfig -a
eth0 Link encap:Ethernet HWaddr 00:1B:21:D9:79:5F
inet addr:192.168.0.102 Bcast:192.168.0.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:328449 errors:0 dropped:0 overruns:0 frame:0
TX packets:278631 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:47679329 (45.4 MiB) TX bytes:39326294 (37.5 MiB)
eth0.2002 Link encap:Ethernet HWaddr 00:1B:21:D9:79:5F
inet addr:10.111.2.37 Bcast:10.111.2.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:94126 errors:0 dropped:0 overruns:0 frame:0
TX packets:53782 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:8313498 (7.9 MiB) TX bytes:6929260 (6.6 MiB)
eth1 Link encap:Ethernet HWaddr 00:1B:21:D9:79:5E
inet addr:192.168.1.102 Bcast:192.168.1.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:121527 errors:0 dropped:0 overruns:0 frame:0
TX packets:41865 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:158117588 (150.7 MiB) TX bytes:5084830 (4.8 MiB)
eth2 Link encap:Ethernet HWaddr 00:25:90:91:6B:50
inet addr:10.111.1.37 Bcast:10.111.1.255 Mask:255.255.254.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:26337628 errors:0 dropped:0 overruns:0 frame:0
TX packets:15500750 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:12526923258 (11.6 GiB) TX bytes:2032767897 (1.8 GiB)
Memory:fb120000-fb140000
eth3 Link encap:Ethernet HWaddr 00:25:90:91:6B:51
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Memory:fb100000-fb120000
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:1895944 errors:0 dropped:0 overruns:0 frame:0
TX packets:1895944 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:154969511 (147.7 MiB) TX bytes:154969511 (147.7 MiB)
推荐答案
The peculiarities of using TCP/IP with Open MPI are described in the FAQ. I'll try to give an executive summary here.
在使用网络接口进行数据交换时,Open MPI使用贪婪的方法.特别是,TCP/IP BTL(字节传输层)和OOB(带外)组件tcp
将尝试使用具有匹配地址族的所有已配置网络接口.在您的情况下,每个节点都有许多接口,这些接口具有来自IPv4地址族的地址:
Open MPI uses a greedy approach when it comes to utilising network interfaces for data exchange. In particular, the TCP/IP BTL (Byte Transfer Layer) and OOB (Out-Of-Band) components tcp
will try to use all configured network interfaces with matching address families. In your case each node has many interfaces with addresses from the IPv4 address family:
comp01-mpi comp02-mpi
----------------------------------------------------------
eth0 192.168.0.101/24 eth0 192.168.0.102/24
eth0.2002 10.111.2.36/24 eth0.2002 10.111.2.37/24
eth1 192.168.1.101/24 eth1 192.168.1.102/24
eth2 10.111.1.36/23 eth2 10.111.1.37/23
lo 127.0.0.1/8 lo 127.0.0.1/8
Open MPI假定comp02-mpi
上的每个接口都可以从comp01-mpi
上的任何接口访问,反之亦然.环回接口lo
从来没有这种情况,因此默认情况下,Open MPI排除lo
.然后,在必须传输信息时,可以延迟(例如,按需)打开网络套接字.
Open MPI assumes that each interface on comp02-mpi
is reachable from any interface on comp01-mpi
and vice versa. This is never the case with the loopback interface lo
, therefore by default Open MPI excludes lo
. Network sockets are then opened lazily (e.g. on demand) when information has to be transported.
在您的情况下发生的情况是,在传输消息时,Open MPI将它们切成碎片,然后尝试通过不同的连接发送不同的段,以使带宽最大化.默认情况下,片段的大小为128 KiB,仅包含32768个int
元素,第一个(急切的)片段的大小为64 KiB,并且包含的元素少两倍.可能会发生这样的假设,即从comp02-mpi
的每个接口可以到达comp01-mpi
的每个接口(反之亦然)的假设是错误的,例如如果其中一些连接到单独的隔离网络.在这种情况下,库将被困于尝试建立永远不会发生的连接,并且程序将挂起.对于包含超过16384个int
元素的消息,通常应该发生这种情况.
What happens in your case is that when transporting messages, Open MPI chops them down into fragments and then tries to send the different segments over different connections in order to maximise the bandwidth. By default the fragments are of size 128 KiB, which only holds 32768 int
elements, also the very first (eager) fragment is of size 64 KiB and holds twice as less elements. It might happen that the assumption that each interface on comp01-mpi
is reachable from each interface on comp02-mpi
(and vice versa) is wrong, e.g. if some of them are connected to separate isolated networks. In that case the library will be stuck in trying to make a connection that can never happen and the program will hang. This should usually happen for messages of more than 16384 int
elements.
为防止上述情况,可以限制Open MPI用于TCP/IP通信的接口或网络. btl_tcp_if_include
MCA参数可用于向库提供应使用的接口列表. btl_tcp_if_exclude
可用于指示库要排除的接口.该默认设置为lo
,如果要排除特定接口,则应将lo
明确添加到列表中.
To prevent the above mentioned situation, one can restrict the interfaces or networks that Open MPI uses for TCP/IP communication. The btl_tcp_if_include
MCA parameter can be used to provide the library with the list of interfaces that it should use. The btl_tcp_if_exclude
can be used to instruct the library which interfaces to exclude. That one is set to lo
by default and if one would like to exclude specific interface(s), then one should explicitly add lo
to the list.
以上所有内容也适用于用于传输特殊信息的带外通信.相反,用于选择或取消选择OOB接口的参数为oob_tcp_if_include
和oob_tcp_if_exclude
.这些通常与BTL参数一起设置.因此,您应该尝试将其设置为实际有效的组合.首先将选择范围缩小到单个界面:
Everything from above also applies to the out-of-band communication used to transport special information. The parameters for selecting or deselecting interfaces for OOB are oob_tcp_if_include
and oob_tcp_if_exclude
conversely. Those are usually set together with the BTL parameters. Therefore you should try setting those to combinations that actually work. Start by narrowing the selection down the a single interface:
mpiexec --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 ...
如果它不适用于eth0
,请尝试其他界面.
If it doesn't work with eth0
, try other interfaces.
虚拟接口eth0.2002
的存在将进一步混淆Open MPI 1.6.2和更高版本.
The presence of the virtual interface eth0.2002
is going to further confuse Open MPI 1.6.2 and newer.
这篇关于传输大数据时无法运行MPI的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!