MPI消息在不同的通信器中收到 - 错误的程序或MPI实现错误? [英] MPI message received in different communicator - erroneous program or MPI implementation bug?

查看:189
本文介绍了MPI消息在不同的通信器中收到 - 错误的程序或MPI实现错误?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我以前的这个问题的后续行动,其结论是程序错误,因此预期的行为是未定义的。

This is a follow-up to this previous question of mine, for which the conclusion was that the program was erroneous, and therefore the expected behavior was undefined.

我在这里创建的是一个简单的错误处理机制,为此我使用Irecv请求空消息作为中止句柄,将其附加到我正常的 MPI_Wait 调用(并将其转换为 MPI_WaitAny ),为了允许我解除对进程1的阻止,以防在进程0发生错误,并且它不能再达到它应该发布匹配的code> MPI_Recv 。

What I'm trying to create here is a simple error-handling mechanism, for which I use that Irecv request for the empty message as an "abort handle", attaching it to my normal MPI_Wait call (and turning it into MPI_WaitAny), in order to allow me to unblock process 1 in case an error occurs on process 0 and it can no longer reach the point where it's supposed to post the matching MPI_Recv.

发生的是,由于内部消息缓冲, MPI_Isend 可能立即成功,而其他进程无法发布匹配的 MPI_Recv 。所以没有办法取消它了。

What's happening is that, due to internal message buffering, the MPI_Isend may succeed right away, without the other process being able to post the matching MPI_Recv. So there's no way of canceling it anymore.

我希望一旦所有进程调用 MPI_Comm_free 我可以忘记关于这个消息一劳永逸,但事实证明,情况并非如此。相反,它被传递到以下通信器中的 MPI_Recv

I was hoping that once all processes call MPI_Comm_free I can just forget about that message once and for all, but, as it turns out, that's not the case. Instead, it's being delivered to the MPI_Recv in the following communicator.

所以我的问题是:


  1. 这也是一个错误的程序,还是MPI实现中的错误(Intel MPI 4.0.3)?

  2. 如果我将 MPI_Isend 调用转换为 MPI_Issend ,程序将按预期方式工作 - 至少我可以案件放心,程序是否正确?

  3. 我是在这里重塑轮?有一个更简单的方法来实现这一点吗?

  1. Is this also an erroneous program, or is it a bug in the MPI implementation (Intel MPI 4.0.3)?
  2. If I turn my MPI_Isend calls into MPI_Issend, the program works as expected - can I at least in that case rest assured that the program is correct?
  3. Am I reinventing the wheel here? Is there a simpler way to achieve this?

再次,任何反馈都非常感谢!

Again, any feedback is much appreciated!

#include "stdio.h"
#include "unistd.h"
#include "mpi.h"
#include "time.h"
#include "stdlib.h"

int main(int argc, char* argv[]) {
    int rank, size;
    MPI_Group group;
    MPI_Comm my_comm;

    srand(time(NULL));
    MPI_Init(&argc, &argv);

    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_group(MPI_COMM_WORLD, &group);

    MPI_Comm_create(MPI_COMM_WORLD, group, &my_comm);
    if (rank == 0) printf("created communicator %d\n", my_comm);

    if (rank == 1) {
        MPI_Request req[2];
        int msg = 123, which;

        MPI_Isend(&msg, 1, MPI_INT, 0, 0, my_comm, &req[0]);
        MPI_Irecv(NULL, 0, MPI_INT, 0, 0, my_comm, &req[1]);

        MPI_Waitany(2, req, &which, MPI_STATUS_IGNORE);

        MPI_Barrier(my_comm);

        if (which == 0) {
            printf("rank 1: send succeed; cancelling abort handle\n");
            MPI_Cancel(&req[1]);
            MPI_Wait(&req[1], MPI_STATUS_IGNORE);
        } else {
            printf("rank 1: send aborted; cancelling send request\n");
            MPI_Cancel(&req[0]);
            MPI_Wait(&req[0], MPI_STATUS_IGNORE);
        }
    } else {
        MPI_Request req;
        int msg, r = rand() % 2;
        if (r) {
            printf("rank 0: receiving message\n");
            MPI_Recv(&msg, 1, MPI_INT, 1, 0, my_comm, MPI_STATUS_IGNORE);
        } else {
            printf("rank 0: sending abort message\n");
            MPI_Isend(NULL, 0, MPI_INT, 1, 0, my_comm, &req);
        }

        MPI_Barrier(my_comm);

        if (!r) {
            MPI_Cancel(&req);
            MPI_Wait(&req, MPI_STATUS_IGNORE);
        }
    }

    if (rank == 0) printf("freeing communicator %d\n", my_comm);
    MPI_Comm_free(&my_comm);

    sleep(2);

    MPI_Comm_create(MPI_COMM_WORLD, group, &my_comm);
    if (rank == 0) printf("created communicator %d\n", my_comm);

    if (rank == 0) {
        MPI_Request req;
        MPI_Status status;
        int msg, cancelled;

        MPI_Irecv(&msg, 1, MPI_INT, 1, 0, my_comm, &req);
        sleep(1);

        MPI_Cancel(&req);
        MPI_Wait(&req, &status);
        MPI_Test_cancelled(&status, &cancelled);

        if (cancelled) {
            printf("rank 0: receive cancelled\n");
        } else {
            printf("rank 0: OLD MESSAGE RECEIVED!!!\n");
        }
    }

    if (rank == 0) printf("freeing communicator %d\n", my_comm);
    MPI_Comm_free(&my_comm);

    MPI_Finalize();
    return 0;
}

输出:

created communicator -2080374784
rank 0: sending abort message
rank 1: send succeed; cancelling abort handle
freeing communicator -2080374784
created communicator -2080374784
rank 0: STRAY MESSAGE RECEIVED!!!
freeing communicator -2080374784


推荐答案

上述@kraffenetti的评论之一,这是一个错误的程序,因为发送的消息没有被接收匹配。即使消息被取消,它们仍然需要在远程端具有匹配的接收,因为可能的是,取消可能不会成功发送的消息,因为它们已经在取消完成之前已经发送(这是这个例子在这里)。

As mentioned in one of the above comments by @kraffenetti, this is an erroneous program because the sent messages are not being matched by receives. Even though the messages are cancelled, they still need to have a matching receive on the remote side because it's possible that the cancel might not be successful for sent messages due to the fact that they were already sent before the cancel can be completed (which is the case here).

这个问题在MPICH的票上开始了一个线程,你可以找到这里

This question started a thread on this on a ticket for MPICH, which you can find here that has more details.

这篇关于MPI消息在不同的通信器中收到 - 错误的程序或MPI实现错误?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆