MPI_BARRIER之后,MPI_SEND停止工作 [英] MPI_SEND stops working after MPI_BARRIER

查看:107
本文介绍了MPI_BARRIER之后,MPI_SEND停止工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在C/MPI中构建一个分布式Web服务器,似乎在我的代码中的第一个MPI_BARRIER之后,点对点通信完全停止了工作.标准C代码在屏障之后工作,因此我知道每个线程都可以通过屏障.点对点通信在障碍之前也可以正常工作.但是,当我将在屏障之前的行工作的相同代码复制粘贴到屏障之后的行时,它将完全停止工作. SEND将永远等待.当我尝试使用ISEND时,它将通过线路通过,但从未收到该消息.我一直在仔细研究这个问题,并且每个MPI_BARRIER问题的人都被告知屏障正常工作,并且他们的代码是错误的,但是我一生无法弄清楚为什么我的代码是错误的.可能是什么原因导致了这种行为?

I'm building a distributed web server in C/MPI and it seems like point-to-point communication completely stops working after the first MPI_BARRIER in my code. Standard C code works after the barrier, so I know that each of the threads makes it through the barrier. Point-to-point communication also works just fine before the barrier. However, when I copy-paste the same code that worked the line before the barrier into the line after the barrier it stops working entirely. The SEND will just wait forever. When I try using an ISEND instead, it makes it through the line, but the message is never received. I've been googling this problem a lot and everyone who has problems with MPI_BARRIER is told the barrier works correctly and their code is wrong, but I cannot for the life of me figure out why my code is wrong. What could be causing this behavior?

这是一个演示此过程的示例程序:

Here is a sample program that demonstrates this:

#include <mpi.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
  int procID;
  int val;
  MPI_Status status;

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &procID);
  MPI_Barrier(MPI_COMM_WORLD);

  if (procID == 0)
  {
    val = 4;
    printf("Before send\n");
    MPI_Send(&val, 1, MPI_INT, 1, 4, MPI_COMM_WORLD);
    printf("after send\n");
  }

  if (procID == 1)
  {
    val = 1;
    printf("before: val = %d\n", val);
    MPI_Recv(&val, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
    printf("after: val = %d\n", val);
  }

  MPI_Finalize();
  return 0;
}

在屏障前移动两个if语句会使该程序正确运行.

Moving the two if statements before the barrier causes this program to run correctly.

编辑-看来无论哪种类型,第一个通信都有效,并且以后所有通信都将失败.这是我一开始想的更笼统的内容.第一次通信是障碍还是其他消息都没有关系,以后的通信都无法正常进行.

EDIT - It appears that the first communication, regardless of type, works, and all future communications fail. This is much more general that I thought at first. It doesn't matter if the first communication is a barrier or some other message, no future communications work properly.

推荐答案

Open MPI在使用TCP/IP进行通信时具有一项已知功能:它尝试使用所有已配置的网络接口"UP"状态.如果某些其他节点无法通过所有这些接口访问,则会出现问题.这是Open MPI进行的贪婪通信优化的一部分,有时会导致问题.

Open MPI has a know feature when it uses TCP/IP for communications: it tries to use all configured network interfaces that are in "UP" state. This presents as a problem if some of the other nodes are not reachable through all those interfaces. This is part of the greedy communication optimisation that Open MPI employs and sometimes, like in your case, leads to problems.

似乎至少第二个节点有多个接口处于运行状态,并且这个事实是在协商阶段引入到第一个节点的:

It seems that at least the second node has more than one interfaces that are up and that this fact was introduced to the first node during the negotiation phase:

  • 一个配置为128.2.100.167的计算机
  • 一个配置为192.168.109.1的计算机(您在计算机上运行的是隧道还是Xen?)

障碍通信发生在第一个网络上,然后下一个MPI_Send尝试通过显然未连接所有节点的第二个网络发送到第二个地址.

The barrier communication happens over the first network and then the next MPI_Send tries to send to the second address over the second network which obviously does not connect all nodes.

最简单的解决方案是仅告诉Open MPI使用连接节点的网络.您可以使用以下MCA参数告诉它:

The easiest solution is to tell Open MPI only to use the nework that connects your nodes. You can tell it do so using the following MCA parameter:

--mca btl_tcp_if_include 128.2.100.0/24

(或任何您的通信网络)

(or whatever your communication network is)

如果所有计算机上的网络接口列表相同,也可以指定网络接口列表,例如

You can also specify the list of network interfaces if it is the same on all machines, e.g.

--mca btl_tcp_if_include eth0

或者您可以告诉Open MPI专门排除某些接口(但是,如果这样做,您必须始终告诉它排除回送"lo"):

or you can tell Open MPI to specifically exclude certain interfaces (but you must always tell it to exclude the loopback "lo" if you do so):

--mca btl_tcp_if_exclude lo,virt0

希望对您和许多其他在SO处遇到相同问题的人有帮助.看起来最近几乎所有Linux发行版都默认开始启动各种网络接口,这很可能导致Open MPI出现问题.

Hope that helps you and many others that appears to have the same problems around here at SO. It looks like that recently almost all Linux distros has started bringing up various network interfaces by default and that is likely to cause problems with Open MPI.

P.S.请把这些节点放在防火墙后面!

P.S. Put those nodes behind a firewall, please!

这篇关于MPI_BARRIER之后,MPI_SEND停止工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆