Cuda Mutex,为什么会出现死锁? [英] Cuda Mutex, why deadlock?

查看:13
本文介绍了Cuda Mutex,为什么会出现死锁?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试实现基于原子的互斥锁.

I am trying to implement a atomic based mutex.

我成功了,但我有一个关于扭曲/死锁的问题.

I succeed it but I have one question about warps / deadlock.

这段代码运行良好.

bool blocked = true;

while(blocked) {
    if(0 == atomicCAS(&mLock, 0, 1)) {
        index = mSize++;

        doCriticJob();

        atomicExch(&mLock, 0);
        blocked = false;
    }
}

但是这个没有……

while(true) {
    if(0 == atomicCAS(&mLock, 0, 1)) {
        index = mSize++;

        doCriticJob();

        atomicExch(&mLock, 0);
        break;
    }
}

我认为这是退出循环的位置.在第一个中,退出发生在条件所在的位置,在第二个中,它发生在 if 的末尾,因此线程等待其他经纱完成循环,但其他线程也等待第一个线程......但我想我我错了,所以如果你能解释我:).

I think it's a position of exiting loop. In the first one, exit happens where the condition is, in the second one it happens in the end of if, so the thread wait for other warps finish loop, but other threads wait the first thread as well... But I think I am wrong, so if you can explain me :).

谢谢!

推荐答案

  1. 这里还有其他关于互斥锁的问题.你可能想看看其中的一些.例如,搜索cuda 临界区".

  1. There are other questions here on mutexes. You might want to look at some of them. Search on "cuda critical section", for example.

假设一个可以工作而一个不能工作,因为它似乎对您的测试用例工作是危险的.管理互斥锁或临界区,尤其是当协商在同一个 warp 中的线程之间时是出了名的困难和脆弱.一般的建议是避免它.如其他地方所讨论的,如果您必须使用互斥锁或临界区,请在线程块中为任何需要它的线程协商,然后使用线程块内同步机制控制线程块内的行为,例如 __syncthreads().

Assuming that one will work and one won't because it seemed to work for your test case is dangerous. Managing mutexes or critical sections, especially when the negotiation is amongst threads in the same warp is notoriously difficult and fragile. The general advice is to avoid it. As discussed elsewhere, if you must use mutexes or critical sections, have a single thread in the threadblock negotiate for any thread that needs it, then control behavior within the threadblock using intra-threadblock synchronization mechanisms, such as __syncthreads().

如果不查看编译器对各种执行路径排序的方式,就无法真正回答这个问题 (IMO).因此,我们需要查看 SASS 代码(机器代码).您可以使用 cuda 二进制实用程序 来执行此操作,并且可能希望同时参考 PTX 参考 以及 SASS 参考.这也意味着您需要完整的代码,而不仅仅是您提供的代码片段.

This question (IMO) can't really be answered without looking at the way the compiler is ordering the various paths of execution. Therefore we need to look at the SASS code (the machine code). You can use the cuda binary utilities to do this, and will probably want to refer to both the PTX reference as well as the SASS reference. This also means that you need a complete code, not just the snippets you've provided.

这是我的分析代码:

$ cat t830.cu
#include <stdio.h>


__device__ int mLock = 0;

__device__ void doCriticJob(){

}

__global__ void kernel1(){

  int index = 0;
  int mSize = 1;
  while(true) {
    if(0 == atomicCAS(&mLock, 0, 1)) {
        index = mSize++;

        doCriticJob();

        atomicExch(&mLock, 0);
        break;
    }
  }
}

__global__ void kernel2(){

  int index = 0;
  int mSize = 1;
  bool blocked = true;

  while(blocked) {
    if(0 == atomicCAS(&mLock, 0, 1)) {
        index = mSize++;

        doCriticJob();

        atomicExch(&mLock, 0);
        blocked = false;
    }
  }
}
int main(){

 kernel2<<<4,128>>>();
 cudaDeviceSynchronize();
}

kernel1 是我对你的死锁代码的表示,而 kernel2 是我对你的工作"代码的表示.当我在 CUDA 7 下的 linux 上编译它并在 cc2.0 设备(Quadro5000)上运行时,如果我调用 kernel1 代码将死锁,如果我调用 kernel2 (如图所示)它没有.

kernel1 is my representation of your deadlock code, and kernel2 is my representation of your "working" code. When I compile this on linux under CUDA 7 and run on a cc2.0 device (Quadro5000), if I call kernel1 the code will deadlock, and if I call kernel2 (as is shown) it doesn't.

我使用 cuobjdump -sass 转储机器码:

I use cuobjdump -sass to dump the machine code:

$ cuobjdump -sass ./t830

Fatbin elf code:
================
arch = sm_20
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

        code for sm_20

Fatbin elf code:
================
arch = sm_20
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

        code for sm_20
                Function : _Z7kernel1v
        .headerflags    @"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
        /*0000*/         MOV R1, c[0x1][0x100];            /* 0x2800440400005de4 */
        /*0008*/         MOV32I R4, 0x1;                   /* 0x1800000004011de2 */
        /*0010*/         SSY 0x48;                         /* 0x60000000c0000007 */
        /*0018*/         MOV R2, c[0xe][0x0];              /* 0x2800780000009de4 */
        /*0020*/         MOV R3, c[0xe][0x4];              /* 0x280078001000dde4 */
        /*0028*/         ATOM.E.CAS R0, [R2], RZ, R4;      /* 0x54080000002fdd25 */
        /*0030*/         ISETP.NE.AND P0, PT, R0, RZ, PT;  /* 0x1a8e0000fc01dc23 */
        /*0038*/     @P0 BRA 0x18;                         /* 0x4003ffff600001e7 */
        /*0040*/         NOP.S;                            /* 0x4000000000001df4 */
        /*0048*/         ATOM.E.EXCH RZ, [R2], RZ;         /* 0x547ff800002fdd05 */
        /*0050*/         EXIT;                             /* 0x8000000000001de7 */
                ............................


                Function : _Z7kernel2v
        .headerflags    @"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
        /*0000*/         MOV R1, c[0x1][0x100];            /* 0x2800440400005de4 */
        /*0008*/         MOV32I R0, 0x1;                   /* 0x1800000004001de2 */
        /*0010*/         MOV32I R3, 0x1;                   /* 0x180000000400dde2 */
        /*0018*/         MOV R4, c[0xe][0x0];              /* 0x2800780000011de4 */
        /*0020*/         MOV R5, c[0xe][0x4];              /* 0x2800780010015de4 */
        /*0028*/         ATOM.E.CAS R2, [R4], RZ, R3;      /* 0x54061000004fdd25 */
        /*0030*/         ISETP.NE.AND P1, PT, R2, RZ, PT;  /* 0x1a8e0000fc23dc23 */
        /*0038*/    @!P1 MOV R0, RZ;                       /* 0x28000000fc0025e4 */
        /*0040*/    @!P1 ATOM.E.EXCH RZ, [R4], RZ;         /* 0x547ff800004fe505 */
        /*0048*/         LOP.AND R2, R0, 0xff;             /* 0x6800c003fc009c03 */
        /*0050*/         I2I.S32.S16 R2, R2;               /* 0x1c00000008a09e84 */
        /*0058*/         ISETP.NE.AND P0, PT, R2, RZ, PT;  /* 0x1a8e0000fc21dc23 */
        /*0060*/     @P0 BRA 0x18;                         /* 0x4003fffec00001e7 */
        /*0068*/         EXIT;                             /* 0x8000000000001de7 */
                ............................



Fatbin ptx code:
================
arch = sm_20
code version = [4,2]
producer = cuda
host = linux
compile_size = 64bit
compressed
$

考虑到单个warp,对于任一代码,所有线程都必须获得一次锁(通过atomicCAS),以使代码成功完成.无论使用哪一种代码,在任何给定时间,warp 中都只有一个线程可以获取锁,并且为了让 warp 中的其他线程(稍后)获取锁,该线程必须有机会释放它(通过 atomicExch).

Considering a single warp, with either code, all threads must acquire the lock (via atomicCAS) once, in order for the code to complete successfully. With either code, only one thread in a warp can acquire the lock at any given time, and in order for other threads in the warp to (later) acquire the lock, that thread must have an opportunity to release it (via atomicExch).

那么,这些实现之间的主要区别在于编译器如何调度 atomicExch 指令相对于条件分支.

The key difference between these realizations then, lies in how the compiler scheduled the atomicExch instruction with respect to conditional branches.

让我们考虑一下死锁"代码(kernel1).在这种情况下,ATOM.E.EXCH 指令直到 一个(也是唯一的)条件分支(@P0 BRA 0x18;) 指令.CUDA 代码中的条件分支代表了一个可能的扭曲发散点,并且在扭曲发散之后的执行在某种程度上是未指定的,并且取决于机器的具体情况.但是考虑到这种不确定性,获取锁的线程有可能会等待其他线程完成它们的分支,在执行 atomicExch 指令之前,这意味着另一个线程将没有机会获得锁,我们就会出现死锁.

Let's consider the "deadlock" code (kernel1). In this case, the ATOM.E.EXCH instruction does not occur until after the one (and only) conditional branch (@P0 BRA 0x18;) instruction. A conditional branch in CUDA code represents a possible point of warp divergence, and execution after warp divergence is, to some degree, unspecified and up to the specifics of the machine. But given this uncertainty, it's possible that the thread that acquired the lock will wait for the other threads to complete their branches, before executing the atomicExch instruction, which means that the other threads will not have a chance to acquire the lock, and we have deadlock.

如果我们将其与工作"代码进行比较,我们会发现,一旦发出 ATOM.E.CAS 指令,其间就没有个条件分支该点和发出 ATOM.E.EXCH 指令的点,从而释放刚刚获得的锁.由于每个获取锁的线程(通过ATOM.E.CAS)都会在任何条件分支发生之前释放它(通过ATOM.E.EXCH),所以没有之前见证的那种死锁(使用 kernel1)发生的任何可能性(鉴于此代码实现).

If we then compare that to the "working" code, we see that once the ATOM.E.CAS instruction is issued, there are no conditional branches in between that point and the point at which the ATOM.E.EXCH instruction is issued, thus releasing the lock just acquired. Since each thread that acquires the lock (via ATOM.E.CAS) will release it (via ATOM.E.EXCH) before any conditional branching occurs, there isn't any possibility (given this code realization) for the kind of deadlock witnessed previously (with kernel1) to occur.

(@P0predication 的一种形式,您可以在 PTX 参考 这里了解它如何导致条件分支.)

(@P0 is a form of predication, and you can read about it in the PTX reference here to understand how it can lead to conditional branching.)

注意:我认为这两个代码都是危险的,并且可能存在缺陷.尽管当前的测试似乎没有发现工作"代码存在问题,但我认为未来的 CUDA 编译器可能会选择以不同的方式安排事情,并破坏该代码.甚至为不同的机器架构编译可能会在这里产生不同的代码.我认为像 this 这样的机制更健壮,可以完全避免内部扭曲争用.然而,即使这样的机制也可能导致线程块间死锁.任何互斥锁都必须在特定的编程和使用限制下使用.

NOTE: I consider both of these codes to be dangerous, and possibly flawed. Even though the current tests don't seem to uncover a problem with the "working" code, I think it's possible that a future CUDA compiler might choose to schedule things differently, and break that code. It's even possible that compiling for a different machine architecture might produce different code here. I consider a mechanism like this to be more robust, which avoids intra-warp contention entirely. Even such a mechanism, however, can lead to inter-threadblock deadlocks. Any mutex must be used under specific programming and usage limitations.

这篇关于Cuda Mutex,为什么会出现死锁?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆