CUDA、互斥锁和 atomicCAS() [英] CUDA, mutex and atomicCAS()

查看:51
本文介绍了CUDA、互斥锁和 atomicCAS()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近我开始在 CUDA 上开发并面临 atomicCAS() 的问题.要对设备代码中的内存进行一些操作,我必须创建一个互斥锁,以便只有一个线程可以在代码的关键部分使用内存.

Recently I started to develop on CUDA and faced with the problem with atomicCAS(). To do some manipulations with memory in device code I have to create a mutex, so that only one thread could work with memory in critical section of code.

下面的设备代码在 1 个块和多个线程上运行.

The device code below runs on 1 block and several threads.

__global__ void cudaKernelGenerateRandomGraph(..., int* mutex)
{
    int i = threadIdx.x;
    ...

    do 
    {
        atomicCAS(mutex, 0, 1 + i);
    }
    while (*mutex != i + 1);

    //critical section
    //do some manipulations with objects in device memory

    *mutex = 0;

    ...
}

当第一个线程执行时

atomicCAS(mutex, 0, 1 + i);

mutex 为 1.之后第一个线程将其状态从 Active 更改为 Inactive,然后行

mutex is 1. After that first thread changes its status from Active to Inactive, and line

*mutex = 0;

未执行.其他线程永远保持循环.我已经尝试过这个循环的许多变体,例如 while(){};、do{}while();,在循环内使用 temp variable = *mutex,甚至使用 if(){} 和 goto 变体.但结果是一样的.

is not executed. Other threads stays forever in loop. I have tried many variants of this cycle like while(){};, do{}while();, with temp variable = *mutex inside loop, even variant with if(){} and goto. But result is the same.

主机部分代码:

...
int verticlesCount = 5;
int *mutex;
cudaMalloc((void **)&mutex, sizeof(int));
cudaMemset(mutex, 0, sizeof(int));
cudaKernelGenerateRandomGraph<<<1, verticlesCount>>>(..., mutex);

我使用带有 CUDA 5.5 的 Visual Studio 2012.

I use Visual Studio 2012 with CUDA 5.5.

设备为 NVidia GeForce GT 240,计算能力为 1.2.

The device is NVidia GeForce GT 240 with compute capability 1.2.

提前致谢.

UPD:今年春天在我的文凭项目上工作了一段时间后,我找到了 cuda 临界区的解决方案.这是无锁和互斥机制的组合.这是工作代码.用它来实现原子动态可调整大小的数组.

UPD: After some time working on my diploma project this spring, I found a solution for critical section on cuda. This is a combination of lock-free and mutex mechanisms. Here is working code. Used it to impelment atomic dynamic-resizable array.

// *mutex should be 0 before calling this function
__global__ void kernelFunction(..., unsigned long long* mutex) 
{
    bool isSet = false; 
    do 
    {
        if (isSet = atomicCAS(mutex, 0, 1) == 0) 
        {
            // critical section goes here
        }
        if (isSet) 
        {
            mutex = 0;
        }
    } 
    while (!isSet);
}

推荐答案

有问题的循环

do 
{
    atomicCAS(mutex, 0, 1 + i);
}
while (*mutex != i + 1);

如果它在主机(CPU)端运行;一旦线程 0 将 *mutex 设置为 1,其他线程就会等待,直到线程 0 将 *mutex 设置回 0.

would work fine if it were running on the host (CPU) side; once thread 0 sets *mutex to 1, the other threads would wait exactly until thread 0 sets *mutex back to 0.

然而,GPU 线程并不像 CPU 线程那样独立.GPU 线程分为 32 个一组,通常称为 warp.同一经线中的线程将以完整的锁步执行指令.如果诸如 ifwhile 之类的控制语句导致 32 个线程中的一些与其余线程发散,则剩余线程将等待(即休眠) 为了完成不同的线程. [1]

However, GPU threads are not as independent as their CPU counterparts. GPU threads are grouped into groups of 32, commonly referred to as warps. Threads in the same warp will execute instructions in complete lock-step. If a control statement such as if or while causes some of the 32 threads to diverge from the rest, the remaining threads will wait (i.e. sleeps) for the divergent threads to finish. [1]

回到有问题的循环,线程 0 变为非活动状态,因为线程 1、2、...、31 仍然停留在 while 循环中.所以线程 0 永远不会到达 *mutex = 0 行,其他 31 个线程永远循环.

Going back to the loop in question, thread 0 becomes inactive because threads 1, 2, ..., 31 are still stuck in the while loop. So thread 0 never reaches the line *mutex = 0, and the other 31 threads loops forever.

一个潜在的解决方案是制作有问题的共享资源的本地副本,让 32 个线程修改副本,然后选择一个线程将更改推送"回共享资源.在这种情况下,__shared__ 变量是理想的:它将由属于同一块的线程共享,但不会由其他块的线程共享.我们可以使用__syncthreads()来精细控制成员线程对这个变量的访问.

A potential solution is to make a local copy of the shared resource in question, let 32 threads modify the copy, and then pick one thread to 'push' the change back to the shared resource. A __shared__ variable is ideal in this situation: it will be shared by the threads belonging to the same block but not other blocks. We can use __syncthreads() to fine-control the access of this variable by the member threads.

[1] CUDA最佳实践指南 - 分支和分歧

避免在同一个扭曲中使用不同的执行路径.

Avoid different execution paths within the same warp.

任何流控制指令(if、switch、do、for、while)都会导致相同warp的线程发散,从而显着影响指令吞吐量;也就是说,遵循不同的执行路径.如果发生这种情况,必须将不同的执行路径序列化,因为一个 warp 的所有线程共享一个程序计数器;这增加了为此扭曲执行的指令总数.当所有不同的执行路径都完成后,线程会收敛到相同的执行路径.

Any flow control instruction (if, switch, do, for, while) can significantly affect the instruction throughput by causing threads of the same warp to diverge; that is, to follow different execution paths. If this happens, the different execution paths must be serialized, since all of the threads of a warp share a program counter; this increases the total number of instructions executed for this warp. When all the different execution paths have completed, the threads converge back to the same execution path.

这篇关于CUDA、互斥锁和 atomicCAS()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆