库达原子变化标志 [英] Cuda atomics change flag

查看:143
本文介绍了库达原子变化标志的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个序列代码,它像这样

I have a piece of serial code which does something like this

if( ! variable )
{
  do some initialization here 
  variable = true;
}



我理解这在串行中是完美的,只能执行一次。

I understand that this works perfectly fine in serial and will only be executed once. What atomics operation would be the correct one here in CUDA?

推荐答案

它看起来像我想要的是一个临界区在你的代码。关键部分允许一个线程执行一系列指令,同时防止任何其他线程或线程块执行这些指令。

It looks to me like what you want is a "critical section" in your code. A critical section allows one thread to execute a sequence of instructions while preventing any other thread or threadblock from executing those instructions.

关键部分可用于控制对例如,以允许单个线程对该区域的未冲突访问。

A critical section can be used to control access to a memory area, for example, so as to allow un-conflicted access to that area by a single thread.

Atomics本身只能用于非常有限的单个操作,对单个变量。

Atomics by themselves can only be used for a very limited, basically single operation, on a single variable. But atomics can be used to build a critical section.

您应该在内核中使用以下代码来控制对关键部分的线程访问:

You should use the following code in your kernel to control thread access to a critical section:

__syncthreads();
if (threadIdx.x == 0)
  acquire_semaphore(&sem);
__syncthreads();
  //begin critical section
  // ... your critical section code goes here
  //end critical section
__syncthreads();
if (threadIdx.x == 0)
  release_semaphore(&sem);
__syncthreads();

在内核定义这些帮助函数和设备变量之前:

Prior to the kernel define these helper functions and device variable:

__device__ volatile int sem = 0;

__device__ void acquire_semaphore(volatile int *lock){
  while (atomicCAS((int *)lock, 0, 1) != 0);
  }

__device__ void release_semaphore(volatile int *lock){
  *lock = 0;
  __threadfence();
  }

我已经测试并成功使用上述代码。注意,它本质上是在每个线程块中使用线程0作为请求者的线程块之间进行仲裁。你应该进一步条件(例如 if(threadIdx.x< ...))你的关键部分代码,如果你只想在获胜的线程块中的一个线程执行临界区代码。

I have tested and used successfully the above code. Note that it essentially arbitrates between threadblocks using thread 0 in each threadblock as a requestor. You should further condition (e.g. if (threadIdx.x < ...)) your critical section code if you want only one thread in the winning threadblock to execute the critical section code.

在一个信号量仲裁内有多个线程提供了额外的复杂性,所以我不推荐这种方法。相反,让每个线程块仲裁如我在这里所示,然后控制你的行为在获胜的线程块使用普通的线程块通信/同步方法(例如 __ syncthreads(),共享内存,等等)

Having multiple threads within a warp arbitrate for a semaphore presents additional complexities, so I don't recommend that approach. Instead, have each threadblock arbitrate as I have shown here, and then control your behavior within the winning threadblock using ordinary threadblock communication/synchronization methods (e.g. __syncthreads(), shared memory, etc.)

请注意,此方法的性能会很高。你应该只使用关键部分,当你不知道如何否则并行化的算法。

Note that this methodology will be costly to performance. You should only use critical sections when you cannot figure out how to otherwise parallelize your algorithm.

最后,一个警告。在任何螺纹并行架构中,不当使用关键段可能导致死锁。具体地,关于线程块内的线程块和/或经线的执行顺序的假设是有缺陷的方法。

Finally, a word of warning. As in any threaded parallel architecture, improper use of critical sections can lead to deadlock. In particular, making assumptions about order of execution of threadblocks and/or warps within a threadblock is a flawed approach.

这篇关于库达原子变化标志的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆