传输大数据时无法运行MPI [英] Unable to run MPI when transfering large data

查看:126
本文介绍了传输大数据时无法运行MPI的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用MPI_Isend将一个字符数组传输到从属节点.当数组的大小很小时,它可以工作,但是当我扩大数组的大小时,它就挂在那里了.

I used MPI_Isend to transfer an array of chars to slave node. When the size of the array is small it worked, but when I enlarge the size of the array, it hanged there.

在主节点(等级0)上运行的代码:

Code running on the master node (rank 0) :

MPI_Send(&text_length,1,MPI_INT,dest,MSG_TEXT_LENGTH,MPI_COMM_WORLD);
MPI_Isend(text->chars, 360358,MPI_CHAR,dest,MSG_SEND_STRING,MPI_COMM_WORLD,&request);
MPI_Wait(&request,&status);

在从属节点(等级1)上运行的代码:

Code running on slave node (rank 1):

MPI_Recv(&count,1,MPI_INT,0,MSG_TEXT_LENGTH,MPI_COMM_WORLD,&status);
MPI_Irecv(host_read_string,count,MPI_CHAR,0,MSG_SEND_STRING,MPI_COMM_WORLD,&request);
MPI_Wait(&request,&status);

您看到MPI_Isend中的count参数是360358.对于MPI来说似乎太大了.当我设置参数1024时,效果很好.

You see the count param in MPI_Isend is 360358. It seemed too large for MPI. When I set the param 1024, it worked well.

实际上,这个问题使我困惑了几天,我知道MPI传输的数据大小受到限制.但据我所知,MPI_Send用于发送短消息,而MPI_Isend可以发送较大的消息.所以我用MPI_Isend.

Actually this problem has confused me a few days, I have known that there's limit on the size of data transferred by MPI. But as far as I know, the MPI_Send is used to send short messages, and the MPI_Isend can send larger messages. So I use MPI_Isend.

排名为0的网络配置为:

The network configure in rank 0 is:

  [12t2007@comp01-mpi.gpu01.cis.k.hosei.ac.jp ~]$ ifconfig -a
eth0      Link encap:Ethernet  HWaddr 00:1B:21:D9:79:A5  
          inet addr:192.168.0.101  Bcast:192.168.0.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:393267 errors:0 dropped:0 overruns:0 frame:0
          TX packets:396421 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:35556328 (33.9 MiB)  TX bytes:79580008 (75.8 MiB)

eth0.2002 Link encap:Ethernet  HWaddr 00:1B:21:D9:79:A5  
          inet addr:10.111.2.36  Bcast:10.111.2.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:133577 errors:0 dropped:0 overruns:0 frame:0
          TX packets:127677 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:14182652 (13.5 MiB)  TX bytes:17504189 (16.6 MiB)

eth1      Link encap:Ethernet  HWaddr 00:1B:21:D9:79:A4  
          inet addr:192.168.1.101  Bcast:192.168.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:206981 errors:0 dropped:0 overruns:0 frame:0
          TX packets:303185 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:168952610 (161.1 MiB)  TX bytes:271792020 (259.2 MiB)

eth2      Link encap:Ethernet  HWaddr 00:25:90:91:6B:56  
          inet addr:10.111.1.36  Bcast:10.111.1.255  Mask:255.255.254.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:26459977 errors:0 dropped:0 overruns:0 frame:0
          TX packets:15700862 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:12533940345 (11.6 GiB)  TX bytes:2078001873 (1.9 GiB)
          Memory:fb120000-fb140000 

eth3      Link encap:Ethernet  HWaddr 00:25:90:91:6B:57  
          BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
          Memory:fb100000-fb120000 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:1894012 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1894012 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:154962344 (147.7 MiB)  TX bytes:154962344 (147.7 MiB)

排名1的网络配置为:

[12t2007@comp02-mpi.gpu01.cis.k.hosei.ac.jp ~]$ ifconfig -a
eth0      Link encap:Ethernet  HWaddr 00:1B:21:D9:79:5F  
          inet addr:192.168.0.102  Bcast:192.168.0.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:328449 errors:0 dropped:0 overruns:0 frame:0
          TX packets:278631 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:47679329 (45.4 MiB)  TX bytes:39326294 (37.5 MiB)

eth0.2002 Link encap:Ethernet  HWaddr 00:1B:21:D9:79:5F  
          inet addr:10.111.2.37  Bcast:10.111.2.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:94126 errors:0 dropped:0 overruns:0 frame:0
          TX packets:53782 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:8313498 (7.9 MiB)  TX bytes:6929260 (6.6 MiB)

eth1      Link encap:Ethernet  HWaddr 00:1B:21:D9:79:5E  
          inet addr:192.168.1.102  Bcast:192.168.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:121527 errors:0 dropped:0 overruns:0 frame:0
          TX packets:41865 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:158117588 (150.7 MiB)  TX bytes:5084830 (4.8 MiB)

eth2      Link encap:Ethernet  HWaddr 00:25:90:91:6B:50  
          inet addr:10.111.1.37  Bcast:10.111.1.255  Mask:255.255.254.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:26337628 errors:0 dropped:0 overruns:0 frame:0
          TX packets:15500750 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:12526923258 (11.6 GiB)  TX bytes:2032767897 (1.8 GiB)
          Memory:fb120000-fb140000 

eth3      Link encap:Ethernet  HWaddr 00:25:90:91:6B:51  
          BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
          Memory:fb100000-fb120000 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:1895944 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1895944 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:154969511 (147.7 MiB)  TX bytes:154969511 (147.7 MiB)

推荐答案

The peculiarities of using TCP/IP with Open MPI are described in the FAQ. I'll try to give an executive summary here.

在使用网络接口进行数据交换时,Open MPI使用贪婪的方法.特别是,TCP/IP BTL(字节传输层)和OOB(带外)组件tcp将尝试使用具有匹配地址族的所有已配置网络接口.在您的情况下,每个节点都有许多接口,这些接口具有来自IPv4地址族的地址:

Open MPI uses a greedy approach when it comes to utilising network interfaces for data exchange. In particular, the TCP/IP BTL (Byte Transfer Layer) and OOB (Out-Of-Band) components tcp will try to use all configured network interfaces with matching address families. In your case each node has many interfaces with addresses from the IPv4 address family:

comp01-mpi                     comp02-mpi
----------------------------------------------------------
eth0       192.168.0.101/24    eth0       192.168.0.102/24
eth0.2002  10.111.2.36/24      eth0.2002  10.111.2.37/24
eth1       192.168.1.101/24    eth1       192.168.1.102/24
eth2       10.111.1.36/23      eth2       10.111.1.37/23
lo         127.0.0.1/8         lo         127.0.0.1/8

Open MPI假定comp02-mpi上的每个接口都可以从comp01-mpi上的任何接口访问,反之亦然.环回接口lo从来没有这种情况,因此默认情况下,Open MPI排除lo.然后,在必须传输信息时,可以延迟(例如,按需)打开网络套接字.

Open MPI assumes that each interface on comp02-mpi is reachable from any interface on comp01-mpi and vice versa. This is never the case with the loopback interface lo, therefore by default Open MPI excludes lo. Network sockets are then opened lazily (e.g. on demand) when information has to be transported.

在您的情况下发生的情况是,在传输消息时,Open MPI将它们切成碎片,然后尝试通过不同的连接发送不同的段,以使带宽最大化.默认情况下,片段的大小为128 KiB,仅包含32768个int元素,第一个(急切的)片段的大小为64 KiB,并且包含的​​元素少两倍.可能会发生这样的假设,即从comp02-mpi的每个接口可以到达comp01-mpi的每个接口(反之亦然)的假设是错误的,例如如果其中一些连接到单独的隔离网络.在这种情况下,库将被困于尝试建立永远不会发生的连接,并且程序将挂起.对于包含超过16384个int元素的消息,通常应该发生这种情况.

What happens in your case is that when transporting messages, Open MPI chops them down into fragments and then tries to send the different segments over different connections in order to maximise the bandwidth. By default the fragments are of size 128 KiB, which only holds 32768 int elements, also the very first (eager) fragment is of size 64 KiB and holds twice as less elements. It might happen that the assumption that each interface on comp01-mpi is reachable from each interface on comp02-mpi (and vice versa) is wrong, e.g. if some of them are connected to separate isolated networks. In that case the library will be stuck in trying to make a connection that can never happen and the program will hang. This should usually happen for messages of more than 16384 int elements.

为防止上述情况,可以限制Open MPI用于TCP/IP通信的接口或网络. btl_tcp_if_include MCA参数可用于向库提供应使用的接口列表. btl_tcp_if_exclude可用于指示库要排除的接口.该默认设置为lo,如果要排除特定接口,则应将lo明确添加到列表中.

To prevent the above mentioned situation, one can restrict the interfaces or networks that Open MPI uses for TCP/IP communication. The btl_tcp_if_include MCA parameter can be used to provide the library with the list of interfaces that it should use. The btl_tcp_if_exclude can be used to instruct the library which interfaces to exclude. That one is set to lo by default and if one would like to exclude specific interface(s), then one should explicitly add lo to the list.

以上所有内容也适用于用于传输特殊信息的带外通信.相反,用于选择或取消选择OOB接口的参数为oob_tcp_if_includeoob_tcp_if_exclude.这些通常与BTL参数一起设置.因此,您应该尝试将其设置为实际有效的组合.首先将选择范围缩小到单个界面:

Everything from above also applies to the out-of-band communication used to transport special information. The parameters for selecting or deselecting interfaces for OOB are oob_tcp_if_include and oob_tcp_if_exclude conversely. Those are usually set together with the BTL parameters. Therefore you should try setting those to combinations that actually work. Start by narrowing the selection down the a single interface:

 mpiexec --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 ...

如果它不适用于eth0,请尝试其他界面.

If it doesn't work with eth0, try other interfaces.

虚拟接口eth0.2002的存在将进一步混淆Open MPI 1.6.2和更高版本.

The presence of the virtual interface eth0.2002 is going to further confuse Open MPI 1.6.2 and newer.

这篇关于传输大数据时无法运行MPI的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆