mpiexec 检查点错误 (RPi) [英] mpiexec checkpointing error (RPi)

查看:17
本文介绍了mpiexec 检查点错误 (RPi)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我尝试运行一个应用程序时(只是一个简单的 hello_world.c 不起作用)我每次都会收到这个错误:

When I try to run an application (just a simple hello_world.c doesn't work) I receive this error every time:

mpiexec -ckpointlib blcr -ckpoint-prefix /tmp/ -ckpoint-interval 10 -machinefile /tmp/machinefile -n 1 ./app_name

[proxy:0:0@masterpi] requesting checkpoint
[proxy:0:0@masterpi] checkpoint completed
[proxy:0:0@masterpi] requesting checkpoint
[proxy:0:0@masterpi] HYDT_ckpoint_checkpoint (./tools/ckpoint/ckpoint.c:111): Previous checkpoint has not completed.[proxy:0:0@masterpi] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:905): checkpoint suspend failed
[proxy:0:0@masterpi] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0@masterpi] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@masterpi] control_cb (./pm/pmiserv/pmiserv_cb.c:202): assert (!closed) failed
[mpiexec@masterpi] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@masterpi] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event
[mpiexec@masterpi] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion

我只想做一个检查点而不是别的(稍后重新启动).

I want just to make a checkpoint and nothing else (and restart later).

提前致谢

我尝试过 MPICH2,没有机会.或者也许我在某个地方错了...

I have tried with MPICH2, no chance. Or maybe I'm wrong somewhere...

pi@raspberrypi ~ $ mpiexec -n 1 -ckpointlib blcr -ckpoint-prefix /tmp/  -ckpoint-interval 2 ./test3
Count to: 0
[proxy:0:0@raspberrypi] requesting checkpoint
[proxy:0:0@raspberrypi] checkpoint completed
Count to: 1
[proxy:0:0@raspberrypi] requesting checkpoint
[proxy:0:0@raspberrypi] HYDT_ckpoint_checkpoint (/tmp/mpich/mpich2-1.5/src/pm/hydra/tools/ckpoint/ckpoint.c:111): Previous checkpoint has not completed.[proxy:0:0@raspberrypi] HYD_pmcd_pmip_control_cmd_cb (/tmp/mpich/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmip_cb.c:902): checkpoint suspend failed
[proxy:0:0@raspberrypi] HYDT_dmxu_poll_wait_for_event (/tmp/mpich/mpich2-1.5/src/pm/hydra/tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0@raspberrypi] main (/tmp/mpich/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmip.c:210): demux engine error waiting for event
[mpiexec@raspberrypi] control_cb (/tmp/mpich/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmiserv_cb.c:201): assert (!closed) failed
[mpiexec@raspberrypi] HYDT_dmxu_poll_wait_for_event (/tmp/mpich/mpich2-1.5/src/pm/hydra/tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@raspberrypi] HYD_pmci_wait_for_completion (/tmp/mpich/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:196): error waiting for event
[mpiexec@raspberrypi] main (/tmp/mpich/mpich2-1.5/src/pm/hydra/ui/mpich/mpiexec.c:325): process manager error waiting for completion

Test3-代码:

#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

int main(int argc, char* argv[]) {

    int rank;
    int size;
    int i = 0;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    MPI_Status status;

    if (rank == 0) {
        for(i; i <=100; i++){
            int j = 0;
            while(j < 100000000){
                j++;
            }
            printf("Count to: %i
", i);
        }
    } else {
    }

    MPI_Finalize();
    return 0;

}

我只需要一个成功的检查点并显示重新启动.如果有人有一个可行的例子(不管它是做什么的,简单的Hello World"会让我开心!)我会很高兴.

I just need to have one successful checkpoint and to show the restart. If someone has a working example (irrelevant what it makes, simple working "Hello World" would make me happy!) I would be very glad.

新年快乐!

推荐答案

这里的问题是检查点间隔太小.将其设置为 20 秒或更长已经解决了这个问题(但不是另一个 :( )问题.

Here the problem was with the too small interval for checkpointing. Setting it to 20s or more has solved this (but not the other :( ) problem.

这篇关于mpiexec 检查点错误 (RPi)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆