CUDA,mutex和atomicCAS() [英] CUDA, mutex and atomicCAS()

查看:618
本文介绍了CUDA,mutex和atomicCAS()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近我开始开发CUDA和面对atomicCAS()的问题。
要在设备代码中使用内存进行某些操作,我必须创建一个互斥,以便只有一个线程可以在关键代码段中使用内存。

Recently I started to develop on CUDA and faced with the problem with atomicCAS(). To do some manipulations with memory in device code I have to create a mutex, so that only one thread could work with memory in critical section of code.

下面的设备代码运行在1块和多个线程上。

The device code below runs on 1 block and several threads.

__global__ void cudaKernelGenerateRandomGraph(..., int* mutex)
{
    int i = threadIdx.x;
    ...

    do 
    {
        atomicCAS(mutex, 0, 1 + i);
    }
    while (*mutex != i + 1);

    //critical section
    //do some manipulations with objects in device memory

    *mutex = 0;

    ...
}

当第一个线程执行

atomicCAS(mutex, 0, 1 + i);

mutex是1.之后,第一个线程将其状态从活动更改为非活动, >

mutex is 1. After that first thread changes its status from Active to Inactive, and line

*mutex = 0;

未执行。其他线程永远在循环中。我已经尝试了这个周期的许多变体像while(){}; do {} while();,使用temp变量= * mutex内循环,甚至变体与if(){}和goto。但结果是一样的。

is not executed. Other threads stays forever in loop. I have tried many variants of this cycle like while(){};, do{}while();, with temp variable = *mutex inside loop, even variant with if(){} and goto. But result is the same.

代码的主机部分:

...
int verticlesCount = 5;
int *mutex;
cudaMalloc((void **)&mutex, sizeof(int));
cudaMemset(mutex, 0, sizeof(int));
cudaKernelGenerateRandomGraph<<<1, verticlesCount>>>(..., mutex);

我使用Visual Studio 2012和CUDA 5.5。

I use Visual Studio 2012 with CUDA 5.5.

该设备是具有计算能力1.2的NVidia GeForce GT 240。

The device is NVidia GeForce GT 240 with compute capability 1.2.

提前感谢。

UPD:
今年春天,我的文凭项目工作了一段时间后,我发现了一个关于cuda的关键部分的解决方案。
这是一个无锁和互斥机制的组合。
这里是工作代码。

UPD: After some time working on my diploma project this spring, I found a solution for critical section on cuda. This is a combination of lock-free and mutex mechanisms. Here is working code. Used it to impelment atomic dynamic-resizable array.

// *mutex should be 0 before calling this function
__global__ void kernelFunction(..., unsigned long long* mutex) 
{
    bool isSet = false; 
    do 
    {
        if (isSet = atomicCAS(mutex, 0, 1) == 0) 
        {
            // critical section goes here
        }
        if (isSet) 
        {
            mutex = 0;
        }
    } 
    while (!isSet);
}


推荐答案

p>

The loop in question

do 
{
    atomicCAS(mutex, 0, 1 + i);
}
while (*mutex != i + 1);

;一旦线程0将 * mutex 设置为1,其他线程将准确等待,直到线程0将 * mutex 恢复为0 。

would work fine if it were running on the host (CPU) side; once thread 0 sets *mutex to 1, the other threads would wait exactly until thread 0 sets *mutex back to 0.

然而,GPU线程并不像它们的CPU对等体一样独立。 GPU线程被分组成32个组,通常称为 warp 同一个warp中的主题会在完成锁定步骤中执行指令。如果如果的控制语句导致32个线程中的一些与其他线程分开, [1]

However, GPU threads are not as independent as their CPU counterparts. GPU threads are grouped into groups of 32, commonly referred to as warps. Threads in the same warp will execute instructions in complete lock-step. If a control statement such as if or while causes some of the 32 threads to diverge from the rest, the remaining threads will wait (i.e. sleeps) for the divergent threads to finish. [1]

返回到循环线程0变得不活动,因为线程1,2,...,31仍然停留在 while 循环中。所以线程0永远不会到达线 * mutex = 0 ,其他31个线程永远循环。

Going back to the loop in question, thread 0 becomes inactive because threads 1, 2, ..., 31 are still stuck in the while loop. So thread 0 never reaches the line *mutex = 0, and the other 31 threads loops forever.

解决方案是使共享资源的本地副本有问题,让32个线程修改副本,然后选择一个线程将更改推送到共享资源。在这种情况下, __ shared __ 变量是理想的:它将由属于同一块但不是其他块的线程共享。我们可以使用 __ syncthreads()精确控制成员线程对此变量的访问。

A potential solution is to make a local copy of the shared resource in question, let 32 threads modify the copy, and then pick one thread to 'push' the change back to the shared resource. A __shared__ variable is ideal in this situation: it will be shared by the threads belonging to the same block but not other blocks. We can use __syncthreads() to fine-control the access of this variable by the member threads.

[1 ] CUDA最佳做法指南 - 分支和分歧


避免在同一个warp中执行不同的执行路径。

Avoid different execution paths within the same warp.

任何流控制指令(if,switch,do,for,while)可以通过使相同warp的线程发散而显着影响指令吞吐量;即遵循不同的执行路径。如果发生这种情况,不同的执行路径必须串行化,因为warp的所有线程共享一个程序计数器;这增加了对该翘曲执行的指令的总数。当所有不同的执行路径都完成后,线程会收敛到同一个执行路径。

Any flow control instruction (if, switch, do, for, while) can significantly affect the instruction throughput by causing threads of the same warp to diverge; that is, to follow different execution paths. If this happens, the different execution paths must be serialized, since all of the threads of a warp share a program counter; this increases the total number of instructions executed for this warp. When all the different execution paths have completed, the threads converge back to the same execution path.

这篇关于CUDA,mutex和atomicCAS()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆