MPICH示例cpi在多个新安装的vps上运行时会生成错误 [英] MPICH example cpi generates error when it runs on multiple fresh installed vps

查看:195
本文介绍了MPICH示例cpi在多个新安装的vps上运行时会生成错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚开始学习有关mpi的知识,因此我购买了3个vps来创建实验环境.我成功安装并配置了ssh和mpich.这三个节点无需密码即可相互ssh(但不能本身).并且cpi示例在本地计算机上没有任何麻烦地通过了.当我尝试在所有3个节点上运行它时,cpi程序始终存在错误 Fatal error in PMPI_Reduce: Unknown error class, error stack:. 这是完整的说明,我做了什么,错误说了什么.

I just begin to learn something about mpi, So I bought 3 vps to create a experiment enviornment. I successfully installed and configed the ssh and mpich. The three nodes could ssh each other (but not itself) without password. And the cpi example passed without any ptoblem on local machine. When I tried to run it on all the 3 nodes, the cpi program always exist with error Fatal error in PMPI_Reduce: Unknown error class, error stack:. Here is the full description what i did and what the error said.

[root@fire examples]# mpiexec -f ~/mpi/machinefile  -n 6 ./cpi
Process 3 of 6 is on mpi0
Process 0 of 6 is on mpi0
Process 1 of 6 is on mpi1
Process 2 of 6 is on mpi2
Process 4 of 6 is on mpi1
Process 5 of 6 is on mpi2
Fatal error in PMPI_Reduce: Unknown error class, error stack:
PMPI_Reduce(1263)...............: MPI_Reduce(sbuf=0x7fff1c18c440, rbuf=0x7fff1c18c448, count=1, MPI_DOUBLE, MPI_SUM, root=0, MPI_COMM_WORLD) failed
MPIR_Reduce_impl(1075)..........:
MPIR_Reduce_intra(826)..........:
MPIR_Reduce_impl(1075)..........:
MPIR_Reduce_intra(881)..........:
MPIR_Reduce_binomial(188).......:
MPIDI_CH3U_Recvq_FDU_or_AEP(636): Communication error with rank 1
MPIR_Reduce_binomial(188).......:
MPIDI_CH3U_Recvq_FDU_or_AEP(636): Communication error with rank 2
MPIR_Reduce_intra(846)..........:
MPIR_Reduce_impl(1075)..........:
MPIR_Reduce_intra(881)..........:
MPIR_Reduce_binomial(250).......: Failure during collective

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 1563 RUNNING AT mpi0
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:2@mpi2] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:2@mpi2] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:2@mpi2] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:1@mpi1] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:1@mpi1] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:1@mpi1] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@mpi0] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@mpi0] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@mpi0] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec@mpi0] main (ui/mpich/mpiexec.c:344): process manager error waiting for completion

我不知道发生了什么,有些见识? 就像评论所暗示的那样,这是mpi cpi代码.

I just have no clue what happened, some insights? As the comment suggests, here is the mpi cpi code.

#include "mpi.h"
#include <stdio.h>
#include <math.h>

double f(double);

double f(double a)
{
    return (4.0 / (1.0 + a*a));
}

int main(int argc,char *argv[])
{
    int    n, myid, numprocs, i;
    double PI25DT = 3.141592653589793238462643;
    double mypi, pi, h, sum, x;
    double startwtime = 0.0, endwtime;
    int    namelen;
    char   processor_name[MPI_MAX_PROCESSOR_NAME];

    MPI_Init(&argc,&argv);
    MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD,&myid);
    MPI_Get_processor_name(processor_name,&namelen);

    fprintf(stdout,"Process %d of %d is on %s\n",
    myid, numprocs, processor_name);
    fflush(stdout);

    n = 10000;          /* default # of rectangles */
    if (myid == 0)
    startwtime = MPI_Wtime();

    MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);

    h   = 1.0 / (double) n;
    sum = 0.0;
    /* A slightly better approach starts from large i and works back */
    for (i = myid + 1; i <= n; i += numprocs)
    {
        x = h * ((double)i - 0.5);
        sum += f(x);
    }
    mypi = h * sum;

    MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

    if (myid == 0) {
        endwtime = MPI_Wtime();
        printf("pi is approximately %.16f, Error is %.16f\n",
               pi, fabs(pi - PI25DT));
        printf("wall clock time = %f\n", endwtime-startwtime);         
        fflush(stdout);
    }

    MPI_Finalize();
    return 0;
}

推荐答案

现在可能为时已晚,无论如何我都会提供答案,我遇到了同样的问题,经过研究后我发现了问题

Its is probably too late, anyway I will provide my answer, I encountered the same problem and after some research I figured out the issue

如果您有一个带有主机名而不是IP地址的机器文件,并且在本地连接了机器,那么您也应该有一个在本地运行的名称服务器,或者将机器文件中的条目更改为ip-address而不是主机名.仅拥有/etc/hosts并不能解决问题

If you have a machinefile with hostnames instead of ip-addresses and have the machines connected locally then you should have a nameserver running locally as well or else change the entries in your machine file to ip-address instead of hostnames. Having just /etc/hosts will not solve the issue

这似乎是我的问题,一旦我将机器文件中的所有内容更改为ip-addresss即可

This seems to be my problem and once I changed the entires in machine file to ip-addresses it works

问候 GOPI

这篇关于MPICH示例cpi在多个新安装的vps上运行时会生成错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆