MPI_Isend和MPI_Irecv似乎造成了死锁 [英] MPI_Isend and MPI_Irecv seem to be causing a deadlock

查看:3072
本文介绍了MPI_Isend和MPI_Irecv似乎造成了死锁的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用非阻塞的MPI通信进程之间发送的各种消息。但是,我似乎得到一个僵局。我已经使用PADB(看到这里)看消息队列和已经得到了以下的输出:

  1:msg12:操作1(pending_receive)状态0(待定)
1:msg12:等级当地4个全球4
1:msg12:大小所需的4
1:msg12:tag_wild 0
1:msg12:标签所需的16
1:msg12:system_buffer 0
1:msg12:缓冲区0xcaad32c
1:msg12:接收:0xcac3c80
1:msg12:数据:4 * MPI_FLOAT
-
1:msg32:操作0(pending_send)状态2(完成)
1:msg32:等级当地4个全球4
1:msg32:实际本地4全球4
1:msg32:大小所需的4实际4
1:msg32:tag_wild 0
1:msg32:标签所需的16个实际16
1:msg32:system_buffer 0
1:msg32:缓冲区0xcaad32c
1:msg32:'发送:0xcab7c00
1:msg32:数据传输完成
-
2:MSG5:操作1(pending_receive)状态0(待定)
2:MSG5:等级局部1全球1
2:MSG5:大小所需的4
2:MSG5:tag_wild 0
2:MSG5:标签所需的16
2:MSG5:system_buffer 0
2:MSG5:缓冲区0xabbc348
2:MSG5:接收:0xabd1780
2:MSG5:数据:4 * MPI_FLOAT
-
2:msg25:操作0(pending_send)状态2(完成)
2:msg25:等级局部1全球1
2:msg25:本地实际1全球1
2:msg25:大小所需的4实际4
2:msg25:tag_wild 0
2:msg25:标签所需的16个实际16
2:msg25:system_buffer 0
2:msg25:缓冲区0xabbc348
2:msg25:'发送:0xabc5700
2:msg25:数据传输完成

这似乎已经表明,将已经完成,但所有的接收正在申请(以上是日志为16的标记值只是一个很小的一部分)。然而,这怎么可能发生?当然,不发送不能完成相关接收完成,如MPI所有发送和接收必须匹配。至少,这就是我想...

任何人都可以提供任何见解?

我可以提供我使用的是这样做的code,但肯定Isend和Irecv应该不管什么样的顺序,他们都被称为在工作,假设MPI_Waitall在到底对不对调用。

更新: code可在这个主旨

更新:我已经做了多处修改了code,但它仍然是不太正常工作。新的code是在相同要点以及我得到的输出是在<一个HREF =htt​​ps://gist.github.com/910543相对=nofollow>这个要点。我有一些问题/本code问题:


  1. 为什么从最终循环(打印所有的数组)与输出的其余部分穿插时,我有一个MPI_Barrier(输出)之前,以确保所有的工作已经在打印前完成出来吗?


  2. 这是可能/理智的从0级到发送0排名 - 将这项工作好吗? (假设一个正确的匹配接收发布,当然)。


  3. 我得到很多的输出很奇怪的长的数字,我以为是一些还挺内存覆盖问题,或变量问题的大小。有趣的是,这必须从MPI通信产生的,因为我初始化new_array至9999.99值和通信显然是导致其改变为这些奇怪的值。任何想法,为什么?


总的来说似乎有些换位的发生(似乎矩阵位换位......),但绝对不是全部 - 这是这些奇怪的数字,都上来了那些最令人担忧的我! / p>

解决方案

在使用 MPI_Isend MPI_Irecv 你有以确保您等待完成请求之前不能修改缓冲区,你肯定违反本。如果你有临危所有进入第二矩阵,而不是代替做呢?

此外, global_x2 * global_y2 是你的标签,但我不知道,这将是对每个发送,收到对,这可能是搞乱东西是独一无二的。如果你把它切换到发送标记(global_y2 * global_columns)+ global_x2 和recieving标记(global_x2 * global_columns)+ global_y2

编辑:关于你提到的有关问题的输出,我假设你正在运行在同一台机器上所有的流程,只是在寻找标准输出测试此。当你做这种方式,你的输出得到由终端奇怪的缓冲,即使printf的code屏障前均实行。有两种方法我解决这个搞定。您既可以打印到一个单独的文件为每个进程,或者你可以把你的输出作为信息处理0,让他做所有的实际打印。

I'm using non-blocking communication in MPI to send various messages between processes. However, I appear to be getting a deadlock. I have used PADB (see here) to look at the message queues and have got the following output:

1:msg12: Operation 1 (pending_receive) status 0 (pending)
1:msg12: Rank local 4 global 4
1:msg12: Size desired 4
1:msg12: tag_wild 0
1:msg12: Tag desired 16
1:msg12: system_buffer 0
1:msg12: Buffer 0xcaad32c
1:msg12: 'Receive: 0xcac3c80'
1:msg12: 'Data: 4 * MPI_FLOAT'
--
1:msg32: Operation 0 (pending_send) status 2 (complete)
1:msg32: Rank local 4 global 4
1:msg32: Actual local 4 global 4
1:msg32: Size desired 4 actual 4
1:msg32: tag_wild 0
1:msg32: Tag desired 16 actual 16
1:msg32: system_buffer 0
1:msg32: Buffer 0xcaad32c
1:msg32: 'Send: 0xcab7c00'
1:msg32: 'Data transfer completed'
--
2:msg5: Operation 1 (pending_receive) status 0 (pending)
2:msg5: Rank local 1 global 1
2:msg5: Size desired 4
2:msg5: tag_wild 0
2:msg5: Tag desired 16
2:msg5: system_buffer 0
2:msg5: Buffer 0xabbc348
2:msg5: 'Receive: 0xabd1780'
2:msg5: 'Data: 4 * MPI_FLOAT'
--
2:msg25: Operation 0 (pending_send) status 2 (complete)
2:msg25: Rank local 1 global 1
2:msg25: Actual local 1 global 1
2:msg25: Size desired 4 actual 4
2:msg25: tag_wild 0
2:msg25: Tag desired 16 actual 16
2:msg25: system_buffer 0
2:msg25: Buffer 0xabbc348
2:msg25: 'Send: 0xabc5700'
2:msg25: 'Data transfer completed'

This seems to have showed that sends have completed, but all of the receives are pending (the above is just an small part of the log for a tag value of 16). However, how can this happen? Surely sends can't complete without the associated receive completing, as in MPI all sends and receives have to match. At least that's what I thought...

Can anyone provide any insights?

I can provide the code I'm using to do this, but surely Isend and Irecv should work regardless of what order they are all called in, assuming that MPI_Waitall is called right at the end.

Update: Code is available at this gist

Update: I've made various modifications to the code, but it still isn't working quite properly. The new code is at the same gist, and the output I'm getting is at this gist. I have a number of questions/issues with this code:

  1. Why is the output from the final loop (printing all of the arrays) interspersed with the rest of the output when I have a MPI_Barrier() before it to make sure all of the work has been done before printing it out?

  2. It is possible/sensible to be sending from rank 0 to rank 0 - will that work ok? (assuming a correct matching receive is posted, of course).

  3. I'm getting lots of very strange long numbers in the output, which I assume is some kinda of memory-overwriting problem, or sizes of variables problem. The interesting thing is that this must be resulting from the MPI communications, because I initialise new_array to a value of 9999.99 and the communication obviously causes it to be changed to these strange values. Any ideas why?

Overall it seems that some of the transposition is occurring (bits of the matrix seem to be transposed...), but definitely not all of it - it's these strange numbers that are coming up that are worrying me the most!

解决方案

When using MPI_Isend and MPI_Irecv you have to be sure to not modify the buffers before you wait for the request to complete, and you are definitely violating this. What if you had the recieves all go into a second matrix instead of doing it in place?

Also, global_x2 * global_y2 is your tag, but I'm not sure that it will be unique for every send-recieve pair, which could be messing things up. What happens if you switch it to sending tag (global_y2 * global_columns) + global_x2 and recieving tag (global_x2 * global_columns) + global_y2.

Edit: As for your question about output, I'm assuming you are testing this by running all your processes on the same machine and just looking at the standard output. When you do it this way, your output gets buffered oddly by the terminal, even though the printf code all executes before the barrier. There are two ways I get around this. You could either print to a separate file for each process, or you could send your output as messages to process 0 and let him do all the actual printing.

这篇关于MPI_Isend和MPI_Irecv似乎造成了死锁的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆