Cuda Mutex,为什么死锁? [英] Cuda Mutex, why deadlock?

查看:179
本文介绍了Cuda Mutex,为什么死锁?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想实现一个基于原子的互斥。



我成功了,但我有一个关于warp / deadlock的问题。



此代码运行良好。

  bool blocked = true; 

while(blocked){
if(0 == atomicCAS(& mLock,0,1)){
index = mSize ++;

doCriticJob();

atomicExch(& mLock,0);
blocked = false;
}
}

但是这个不... ...



  while(true){
if(0 == atomicCAS(& mLock,0,1)){
index = mSize ++;

doCriticJob();

atomicExch(& mLock,0);
break;
}
}



我认为这是一个退出循环的位置。在第一个,退出发生在条件的地方,在第二个发生在if的结尾,所以线程等待其他经线完成循环,但其他线程等待第一个线程...但我认为我

解决方案


  1. 互斥体上还有其他问题。你可能想看看其中的一些。例如,搜索cuda critical section。


  2. 假设一个将工作,一个不会,因为它似乎为您的测试案例工作是危险的。管理互斥体或关键部分,特别是当协商在同一warp中的线程时是非常困难和脆弱的。一般建议是避免它。正如其他地方所讨论的,如果必须使用互斥体或关键部分,则在线程块中有一个线程协商需要它的任何线程,然后使用线程内部同步机制控制线程块内的行为,例如 __ syncthreads ()


  3. 这个问题(IMO)不能真正回答,无需查看编译器订购各种执行路径。因此,我们需要看看SASS代码(机器代码)。您可以使用 cuda二进制实用程序执行此操作,并且可能需要同时参考 PTX参考以及 SASS参考。这也意味着您需要一个完整的代码,而不只是您提供的代码段。


以下是我的分析代码:

  $ cat t830.cu 
#include< stdio.h>


__device__ int mLock = 0;

__device__ void doCriticJob(){

}

__global__ void kernel1(){

int index = 0;
int mSize = 1;
while(true){
if(0 == atomicCAS(& mLock,0,1)){
index = mSize ++;

doCriticJob();

atomicExch(& mLock,0);
break;
}
}
}

__global__ void kernel2(){

int index = 0;
int mSize = 1;
bool blocked = true;

while(blocked){
if(0 == atomicCAS(& mLock,0,1)){
index = mSize ++;

doCriticJob();

atomicExch(& mLock,0);
blocked = false;
}
}
}
int main(){

kernel2 <<< 4,128>>
cudaDeviceSynchronize();
}

代表你的死锁代码, kernel2 是我代表你的工作代码。当我在CUDA 7上编译这个文件并在一个cc2.0设备(Quadro5000)上运行时,如果我调用 kernel1 ,代码将死锁,如果我调用 kernel2 (如图所示)。



我使用 cuobjdump -sass 转储机器码:

  $ cuobjdump -sass ./t830 

Fatbin elf代码:
================
arch = sm_20
代码版本= [1,7]
producer =未知>
host = linux
compile_size = 64bit

sm_20的代码

Fatbin elf代码:
========= =======
arch = sm_20
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

sm_20
的代码函数:_Z7kernel1v
.headerflags @EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)
/ * 0000 * / MOV R1,c [0x1] [0x100] / * 0x2800440400005de4 * /
/ * 0008 * / MOV32I R4,0x1; / * 0x1800000004011de2 * /
/ * 0010 * / SSY 0x48; / * 0x60000000c0000007 * /
/ * 0018 * / MOV R2,c [0xe] [0x0]; / * 0x2800780000009de4 * /
/ * 0020 * / MOV R3,c [0xe] [0x4]; / * 0x280078001000dde4 * /
/ * 0028 * / ATOM.E.CAS R0,[R2],RZ,R4; / * 0x54080000002fdd25 * /
/ * 0030 * / ISETP.NE.AND P0,PT,R0,RZ,PT; / * 0x1a8e0000fc01dc23 * /
/ * 0038 * / @ P0 BRA 0x18; / * 0x4003ffff600001e7 * /
/ * 0040 * / NOP.S; / * 0x4000000000001df4 * /
/ * 0048 * / ATOM.E.EXCH RZ,[R2],RZ; / * 0x547ff800002fdd05 * /
/ * 0050 * / EXIT; / * 0x8000000000001de7 * /
............................


功能:_Z7kernel2v
.headerflags @EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)
/ * 0000 * / MOV R1,c [0x1] [0x100] / * 0x2800440400005de4 * /
/ * 0008 * / MOV32I R0,0x1; / * 0x1800000004001de2 * /
/ * 0010 * / MOV32I R3,0x1; / * 0x180000000400dde2 * /
/ * 0018 * / MOV R4,c [0xe] [0x0]; / * 0x2800780000011de4 * /
/ * 0020 * / MOV R5,c [0xe] [0x4]; / * 0x2800780010015de4 * /
/ * 0028 * / ATOM.E.CAS R2,[R4],RZ,R3; / * 0x54061000004fdd25 * /
/ * 0030 * / ISETP.NE.AND P1,PT,R2,RZ,PT; / * 0x1a8e0000fc23dc23 * /
/ * 0038 * / @!P1 MOV R0,RZ; / * 0x28000000fc0025e4 * /
/ * 0040 * / @!P1 ATOM.E.EXCH RZ,[R4],RZ; / * 0x547ff800004fe505 * /
/ * 0048 * / LOP.AND R2,R0,0xff; / * 0x6800c003fc009c03 * /
/ * 0050 * / I2I.S32.S16 R2,R2; / * 0x1c00000008a09e84 * /
/ * 0058 * / ISETP.NE.AND P0,PT,R2,RZ,PT; / * 0x1a8e0000fc21dc23 * /
/ * 0060 * / @ P0 BRA 0x18; / * 0x4003fffec00001e7 * /
/ * 0068 * / EXIT; / * 0x8000000000001de7 * /
............................



Fatbin ptx代码:
================
arch = sm_20
代码版本= [4,2]
producer = cuda
host = linux
compile_size = 64bit
compressed
$


b $ b

考虑一个单一的warp,使用任何一个代码,所有线程必须获得锁(通过 atomicCAS )一次,以便代码成功完成。对于任一代码,在经线中只有一个线程可以在任何给定时间获得锁,并且为了使线程中的其他线程(稍后)获得锁,该线程必须有机会释放它, em>(通过 atomicExch )。



这些实现之间的关键区别在于编译器如何计划 atomicExch 指令



让我们考虑死锁代码( kernel1 )。在这种情况下, ATOM.E.EXCH 指令不会发生,直到之后 > @ P0 BRA 0x18; )指令。 CUDA代码中的条件分支表示可能的经度分散点,并且在经度分散之后的执行在某种程度上是未指定的并且直到机器的细节。但是考虑到这种不确定性,可能在执行 atomicExch 指令之前,获得锁的线程将等待其他线程完成它们的分支 ,这意味着其他线程将没有机会获取锁,并且我们有死锁。



如果我们然后将其与工作代码进行比较,一旦发出了 ATOM.E.CAS 指令,则在该点和之间存在个条件分支c $ c> ATOM.E.EXCH 指令,从而释放刚刚获取的锁。因为每个获取锁的线程(通过 ATOM.E.CAS )将释放它(通过 ATOM.E.EXCH )在发生任何条件分支之前,没有任何可能性(给定这个代码实现)发生之前发生的死锁类型( kernel1 )。



@ P0 是一种形式的预测,您可以在PTX请参阅此处,了解如何导致条件分支。)



注意:我认为这两个代码都很危险,可能有缺陷。即使当前的测试似乎没有揭示工作代码的问题,我认为未来的CUDA编译器可能选择不同的计划,并打破该代码。甚至可能编译不同的机器架构可能会在这里产生不同的代码。我认为一个像这样的机制更加健壮,可以完全避免内部竞争。然而,即使这样的机制也可能导致线程间块死锁。任何互斥体必须在特定的编程和使用限制下使用。


I am trying to implement a atomic based mutex.

I succeed it but I have one question about warps / deadlock.

This code works well.

bool blocked = true;

while(blocked) {
    if(0 == atomicCAS(&mLock, 0, 1)) {
        index = mSize++;

        doCriticJob();

        atomicExch(&mLock, 0);
        blocked = false;
    }
}

But this one doesn't...

while(true) {
    if(0 == atomicCAS(&mLock, 0, 1)) {
        index = mSize++;

        doCriticJob();

        atomicExch(&mLock, 0);
        break;
    }
}

I think it's a position of exiting loop. In the first one, exit happens where the condition is, in the second one it happens in the end of if, so the thread wait for other warps finish loop, but other threads wait the first thread as well... But I think I am wrong, so if you can explain me :).

Thanks !

解决方案

  1. There are other questions here on mutexes. You might want to look at some of them. Search on "cuda critical section", for example.

  2. Assuming that one will work and one won't because it seemed to work for your test case is dangerous. Managing mutexes or critical sections, especially when the negotiation is amongst threads in the same warp is notoriously difficult and fragile. The general advice is to avoid it. As discussed elsewhere, if you must use mutexes or critical sections, have a single thread in the threadblock negotiate for any thread that needs it, then control behavior within the threadblock using intra-threadblock synchronization mechanisms, such as __syncthreads().

  3. This question (IMO) can't really be answered without looking at the way the compiler is ordering the various paths of execution. Therefore we need to look at the SASS code (the machine code). You can use the cuda binary utilities to do this, and will probably want to refer to both the PTX reference as well as the SASS reference. This also means that you need a complete code, not just the snippets you've provided.

Here's my code for analysis:

$ cat t830.cu
#include <stdio.h>


__device__ int mLock = 0;

__device__ void doCriticJob(){

}

__global__ void kernel1(){

  int index = 0;
  int mSize = 1;
  while(true) {
    if(0 == atomicCAS(&mLock, 0, 1)) {
        index = mSize++;

        doCriticJob();

        atomicExch(&mLock, 0);
        break;
    }
  }
}

__global__ void kernel2(){

  int index = 0;
  int mSize = 1;
  bool blocked = true;

  while(blocked) {
    if(0 == atomicCAS(&mLock, 0, 1)) {
        index = mSize++;

        doCriticJob();

        atomicExch(&mLock, 0);
        blocked = false;
    }
  }
}
int main(){

 kernel2<<<4,128>>>();
 cudaDeviceSynchronize();
}

kernel1 is my representation of your deadlock code, and kernel2 is my representation of your "working" code. When I compile this on linux under CUDA 7 and run on a cc2.0 device (Quadro5000), if I call kernel1 the code will deadlock, and if I call kernel2 (as is shown) it doesn't.

I use cuobjdump -sass to dump the machine code:

$ cuobjdump -sass ./t830

Fatbin elf code:
================
arch = sm_20
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

        code for sm_20

Fatbin elf code:
================
arch = sm_20
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

        code for sm_20
                Function : _Z7kernel1v
        .headerflags    @"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
        /*0000*/         MOV R1, c[0x1][0x100];            /* 0x2800440400005de4 */
        /*0008*/         MOV32I R4, 0x1;                   /* 0x1800000004011de2 */
        /*0010*/         SSY 0x48;                         /* 0x60000000c0000007 */
        /*0018*/         MOV R2, c[0xe][0x0];              /* 0x2800780000009de4 */
        /*0020*/         MOV R3, c[0xe][0x4];              /* 0x280078001000dde4 */
        /*0028*/         ATOM.E.CAS R0, [R2], RZ, R4;      /* 0x54080000002fdd25 */
        /*0030*/         ISETP.NE.AND P0, PT, R0, RZ, PT;  /* 0x1a8e0000fc01dc23 */
        /*0038*/     @P0 BRA 0x18;                         /* 0x4003ffff600001e7 */
        /*0040*/         NOP.S;                            /* 0x4000000000001df4 */
        /*0048*/         ATOM.E.EXCH RZ, [R2], RZ;         /* 0x547ff800002fdd05 */
        /*0050*/         EXIT;                             /* 0x8000000000001de7 */
                ............................


                Function : _Z7kernel2v
        .headerflags    @"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
        /*0000*/         MOV R1, c[0x1][0x100];            /* 0x2800440400005de4 */
        /*0008*/         MOV32I R0, 0x1;                   /* 0x1800000004001de2 */
        /*0010*/         MOV32I R3, 0x1;                   /* 0x180000000400dde2 */
        /*0018*/         MOV R4, c[0xe][0x0];              /* 0x2800780000011de4 */
        /*0020*/         MOV R5, c[0xe][0x4];              /* 0x2800780010015de4 */
        /*0028*/         ATOM.E.CAS R2, [R4], RZ, R3;      /* 0x54061000004fdd25 */
        /*0030*/         ISETP.NE.AND P1, PT, R2, RZ, PT;  /* 0x1a8e0000fc23dc23 */
        /*0038*/    @!P1 MOV R0, RZ;                       /* 0x28000000fc0025e4 */
        /*0040*/    @!P1 ATOM.E.EXCH RZ, [R4], RZ;         /* 0x547ff800004fe505 */
        /*0048*/         LOP.AND R2, R0, 0xff;             /* 0x6800c003fc009c03 */
        /*0050*/         I2I.S32.S16 R2, R2;               /* 0x1c00000008a09e84 */
        /*0058*/         ISETP.NE.AND P0, PT, R2, RZ, PT;  /* 0x1a8e0000fc21dc23 */
        /*0060*/     @P0 BRA 0x18;                         /* 0x4003fffec00001e7 */
        /*0068*/         EXIT;                             /* 0x8000000000001de7 */
                ............................



Fatbin ptx code:
================
arch = sm_20
code version = [4,2]
producer = cuda
host = linux
compile_size = 64bit
compressed
$

Considering a single warp, with either code, all threads must acquire the lock (via atomicCAS) once, in order for the code to complete successfully. With either code, only one thread in a warp can acquire the lock at any given time, and in order for other threads in the warp to (later) acquire the lock, that thread must have an opportunity to release it (via atomicExch).

The key difference between these realizations then, lies in how the compiler scheduled the atomicExch instruction with respect to conditional branches.

Let's consider the "deadlock" code (kernel1). In this case, the ATOM.E.EXCH instruction does not occur until after the one (and only) conditional branch (@P0 BRA 0x18;) instruction. A conditional branch in CUDA code represents a possible point of warp divergence, and execution after warp divergence is, to some degree, unspecified and up to the specifics of the machine. But given this uncertainty, it's possible that the thread that acquired the lock will wait for the other threads to complete their branches, before executing the atomicExch instruction, which means that the other threads will not have a chance to acquire the lock, and we have deadlock.

If we then compare that to the "working" code, we see that once the ATOM.E.CAS instruction is issued, there are no conditional branches in between that point and the point at which the ATOM.E.EXCH instruction is issued, thus releasing the lock just acquired. Since each thread that acquires the lock (via ATOM.E.CAS) will release it (via ATOM.E.EXCH) before any conditional branching occurs, there isn't any possibility (given this code realization) for the kind of deadlock witnessed previously (with kernel1) to occur.

(@P0 is a form of predication, and you can read about it in the PTX reference here to understand how it can lead to conditional branching.)

NOTE: I consider both of these codes to be dangerous, and possibly flawed. Even though the current tests don't seem to uncover a problem with the "working" code, I think it's possible that a future CUDA compiler might choose to schedule things differently, and break that code. It's even possible that compiling for a different machine architecture might produce different code here. I consider a mechanism like this to be more robust, which avoids intra-warp contention entirely. Even such a mechanism, however, can lead to inter-threadblock deadlocks. Any mutex must be used under specific programming and usage limitations.

这篇关于Cuda Mutex,为什么死锁?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆