Cuda 原子更改标志 [英] Cuda atomics change flag

查看:23
本文介绍了Cuda 原子更改标志的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一段串行代码可以做这样的事情

I have a piece of serial code which does something like this

if( ! variable )
{
  do some initialization here 
  variable = true;
}

我知道这在串行中工作得很好,并且只会执行一次.在 CUDA 中,哪种原子操作是正确的?

I understand that this works perfectly fine in serial and will only be executed once. What atomics operation would be the correct one here in CUDA?

推荐答案

在我看来,您想要的是代码中的关键部分".临界区允许一个线程执行一系列指令,同时阻止任何其他线程或线程块执行这些指令.

It looks to me like what you want is a "critical section" in your code. A critical section allows one thread to execute a sequence of instructions while preventing any other thread or threadblock from executing those instructions.

临界区可用于控制对内存区域的访问,例如,以允许单个线程无冲突地访问该区域.

A critical section can be used to control access to a memory area, for example, so as to allow un-conflicted access to that area by a single thread.

原子本身只能用于对单个变量进行非常有限的、基本上是单一的操作.但是原子可以用来构建临界区.

Atomics by themselves can only be used for a very limited, basically single operation, on a single variable. But atomics can be used to build a critical section.

您应该在内核中使用以下代码来控制线程对临界区的访问:

You should use the following code in your kernel to control thread access to a critical section:

__syncthreads();
if (threadIdx.x == 0)
  acquire_semaphore(&sem);
__syncthreads();
  //begin critical section
  // ... your critical section code goes here
  //end critical section
__threadfence(); // not strictly necessary for the lock, but to make any global updates in the critical section visible to other threads in the grid
__syncthreads();
if (threadIdx.x == 0)
  release_semaphore(&sem);
__syncthreads();

在内核定义这些辅助函数和设备变量之前:

Prior to the kernel define these helper functions and device variable:

__device__ volatile int sem = 0;

__device__ void acquire_semaphore(volatile int *lock){
  while (atomicCAS((int *)lock, 0, 1) != 0);
  }

__device__ void release_semaphore(volatile int *lock){
  *lock = 0;
  __threadfence();
  }

我已经测试并成功使用了上面的代码.请注意,它本质上使用每个线程块中的线程 0 作为请求者在 线程块 之间进行仲裁.如果您只希望获胜线程块中的一个线程执行临界区代码,您应该进一步调整(例如 if (threadIdx.x < ...))临界区代码.

I have tested and used successfully the above code. Note that it essentially arbitrates between threadblocks using thread 0 in each threadblock as a requestor. You should further condition (e.g. if (threadIdx.x < ...)) your critical section code if you want only one thread in the winning threadblock to execute the critical section code.

在一个信号量的 warp 仲裁中拥有多个线程会带来额外的复杂性,所以我不推荐这种方法.相反,让每个线程块仲裁,就像我在这里展示的那样,然后使用普通的线程块通信/同步方法(例如 __syncthreads()、共享内存等)控制获胜线程块内的行为.

Having multiple threads within a warp arbitrate for a semaphore presents additional complexities, so I don't recommend that approach. Instead, have each threadblock arbitrate as I have shown here, and then control your behavior within the winning threadblock using ordinary threadblock communication/synchronization methods (e.g. __syncthreads(), shared memory, etc.)

请注意,这种方法对性能的影响会很大.仅当您无法弄清楚如何以其他方式并行化您的算法时,才应使用临界区.

Note that this methodology will be costly to performance. You should only use critical sections when you cannot figure out how to otherwise parallelize your algorithm.

最后,警告一下.与任何线程并行架构一样,临界区的不当使用会导致死锁.特别是,对线程块和/或线程块内的扭曲的执行顺序进行假设是一种有缺陷的方法.

Finally, a word of warning. As in any threaded parallel architecture, improper use of critical sections can lead to deadlock. In particular, making assumptions about order of execution of threadblocks and/or warps within a threadblock is a flawed approach.

这篇关于Cuda 原子更改标志的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆