MPI_Barrier无法正常运行 [英] MPI_Barrier doesn't function properly

查看:244
本文介绍了MPI_Barrier无法正常运行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在下面编写了C应用程序,以帮助我理解MPI,以及为什么MPI_Barrier()在巨大的C ++应用程序中无法正常工作.但是,我能够用一个较小的C应用程序在庞大的应用程序中重现我的问题.本质上,我在for循环内调用MPI_Barrier(),并且MPI_Barrier()对所有节点可见,但是在循环2次迭代之后,程序变为死锁.有什么想法吗?

I wrote the C application below to help me understand MPI, and why MPI_Barrier() isn't functioning in my huge C++ application. However, I was able to reproduce my problem in my huge application with a much smaller C application. Essentially, I call MPI_Barrier() inside a for loop, and MPI_Barrier() is visible to all nodes, yet after 2 iterations of the loop, the program becomes deadlocked. Any thoughts?

#include <mpi.h>
#include <stdio.h>
int main(int argc, char* argv[]) {
    MPI_Init(&argc, &argv);
    int i=0, numprocs, rank, namelen;
    char processor_name[MPI_MAX_PROCESSOR_NAME];
    MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Get_processor_name(processor_name, &namelen);
    printf("%s: Rank %d of %d\n", processor_name, rank, numprocs);
    for(i=1; i <= 100; i++) {
            if (rank==0) printf("Before barrier (%d:%s)\n",i,processor_name);
            MPI_Barrier(MPI_COMM_WORLD);
            if (rank==0) printf("After barrier (%d:%s)\n",i,processor_name);
    }

    MPI_Finalize();
    return 0;
}

输出:

alienone: Rank 1 of 4
alienfive: Rank 3 of 4
alienfour: Rank 2 of 4
alientwo: Rank 0 of 4
Before barrier (1:alientwo)
After barrier (1:alientwo)
Before barrier (2:alientwo)
After barrier (2:alientwo)
Before barrier (3:alientwo)

我正在使用Ubuntu 10.10存储库中的GCC 4.4,Open MPI 1.3.

I am using GCC 4.4, Open MPI 1.3 from the Ubuntu 10.10 repositories.

此外,在我庞大的C ++应用程序中,MPI广播不起作用.只有一半的节点接收到广播,其他节点则停留在等待中.

Also, in my huge C++ application, MPI Broadcasts don't work. Only half the nodes receive the broadcast, the others are stuck waiting for it.

在此先感谢您的帮助或见解!

Thank you in advance for any help or insights!

更新:已升级到Open MPI 1.4.4,已从源代码编译为/usr/local/.

Update: Upgraded to Open MPI 1.4.4, compiled from source into /usr/local/.

更新:将GDB附加到正在运行的进程中会显示出有趣的结果.在我看来,MPI系统陷入了困境,但MPI仍然认为程序正在运行:

Update: Attaching GDB to the running process shows an interesting result. It looks to me that the MPI system died at the barrier, but MPI still thinks the program is running:

附加GDB会产生有趣的结果.似乎所有节点都死于MPI障碍,但是MPI仍然认为它们正在运行:

Attaching GDB yields an interesting result. It seems all nodes have died at the MPI barrier, but MPI still thinks they are running:

0x00007fc235cbd1c8 in __poll (fds=0x15ee360, nfds=8, timeout=<value optimized out>) at   ../sysdeps/unix/sysv/linux/poll.c:83
83  ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
    in ../sysdeps/unix/sysv/linux/poll.c
(gdb) bt
#0  0x00007fc235cbd1c8 in __poll (fds=0x15ee360, nfds=8, timeout=<value optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:83
#1  0x00007fc236a45141 in poll_dispatch () from /usr/local/lib/libopen-pal.so.0
#2  0x00007fc236a43f89 in opal_event_base_loop () from /usr/local/lib/libopen-pal.so.0
#3  0x00007fc236a38119 in opal_progress () from /usr/local/lib/libopen-pal.so.0
#4  0x00007fc236eff525 in ompi_request_default_wait_all () from /usr/local/lib/libmpi.so.0
#5  0x00007fc23141ad76 in ompi_coll_tuned_sendrecv_actual () from /usr/local/lib/openmpi/mca_coll_tuned.so
#6  0x00007fc2314247ce in ompi_coll_tuned_barrier_intra_recursivedoubling () from /usr/local/lib/openmpi/mca_coll_tuned.so
#7  0x00007fc236f15f12 in PMPI_Barrier () from /usr/local/lib/libmpi.so.0
#8  0x0000000000400b32 in main (argc=1, argv=0x7fff5883da58) at barrier_test.c:14
(gdb) 

更新: 我也有这段代码:

Update: I also have this code:

#include <mpi.h>
#include <stdio.h>
#include <math.h>
int main( int argc, char *argv[] )  {
int n = 400, myid, numprocs, i;
double PI25DT = 3.141592653589793238462643;
double mypi, pi, h, sum, x;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
printf("MPI Rank %i of %i.\n", myid, numprocs);
while (1) {
    h   = 1.0 / (double) n;
    sum = 0.0;
    for (i = myid + 1; i <= n; i += numprocs) {
        x = h * ((double)i - 0.5);
        sum += (4.0 / (1.0 + x*x));
    }
    mypi = h * sum;
    MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
    if (myid == 0)
        printf("pi is approximately %.16f, Error is %.16f\n",  pi, fabs(pi - PI25DT));
}
MPI_Finalize();
return 0;
}

尽管存在无限循环,但循环中的printf()只有一个输出:

And despite the infinite loop, there is only one output from the printf() in the loop:

mpirun -n 24 --machinefile /etc/machines a.out 
MPI Rank 0 of 24.
MPI Rank 3 of 24.
MPI Rank 1 of 24.
MPI Rank 4 of 24.
MPI Rank 17 of 24.
MPI Rank 15 of 24.
MPI Rank 5 of 24.
MPI Rank 7 of 24.
MPI Rank 16 of 24.
MPI Rank 2 of 24.
MPI Rank 11 of 24.
MPI Rank 9 of 24.
MPI Rank 8 of 24.
MPI Rank 20 of 24.
MPI Rank 23 of 24.
MPI Rank 19 of 24.
MPI Rank 12 of 24.
MPI Rank 13 of 24.
MPI Rank 21 of 24.
MPI Rank 6 of 24.
MPI Rank 10 of 24.
MPI Rank 18 of 24.
MPI Rank 22 of 24.
MPI Rank 14 of 24.
pi is approximately 3.1415931744231269, Error is 0.0000005208333338

有什么想法吗?

推荐答案

在上一个屏障之后经过的不同时间,当进程遇到屏障时,OpenMPI中的MPI_Barrier()有时会挂起,但是我所看到的并不是您的情况.无论如何,请尝试使用MPI_Reduce()代替或在实际调用MPI_Barrier()之前使用.这不是直接等同于屏障,但是任何几乎不涉及通信器中所有进程的有效负载的同步调用都应像屏障一样工作.我还没有在LAM/MPI或MPICH2甚至WMPI中看到MPI_Barrier()的这种行为,但这是OpenMPI的真正问题.

MPI_Barrier() in OpenMPI sometimes hangs when processes come across the barrier at different times passed after last barrier, however that's not your case as I can see. Anyway, try using MPI_Reduce() instead or before the real call to MPI_Barrier(). This is not a direct equivalent to barrier, but any synchronous call with almost no payload involving all processes in a communicator should work like a barrier. I haven't seen such behavior of MPI_Barrier() in LAM/MPI or MPICH2 or even WMPI, but it was a real issue with OpenMPI.

这篇关于MPI_Barrier无法正常运行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆