经线内的CUDA __syncthreads()用法 [英] CUDA __syncthreads() usage within a warp
问题描述
如果绝对要求一个块中的所有线程都在代码中位于同一点,那么如果启动的线程数等于扭曲中的线程数,我们是否需要__syncthreads函数? /p>
注意:没有多余的线程或块,只是内核的一个扭曲.
示例代码:
shared _voltatile_ sdata[16];
int index = some_number_between_0_and_15;
sdata[tid] = some_number;
output[tid] = x ^ y ^ z ^ sdata[index];
已更新有关使用易失性的更多信息
假定您希望所有线程处于同一点,因为它们正在将其他线程写入的数据读取到共享内存中,如果要启动单个扭曲(在每个块中),那么您就会知道所有线程正在一起执行.从表面上看,这意味着您可以省略__syncthreads()
,这是一种称为翘曲同步编程"的做法.但是,有一些注意事项.
- 请记住,只要线程内语义保持正确,编译器将假定它可以优化,包括将存储延迟到内存中,以便将数据保存在寄存器中.
__syncthreads()
对此有障碍,因此可以确保在其他线程读取数据之前将数据写入共享内存.使用volatile
会使编译器执行内存写操作,而不是保留在寄存器中,但这会带来一些风险,并且更容易被黑客入侵(这意味着我不知道将来如何影响)- 从技术上讲,您应始终使用
__syncthreads()
来符合CUDA编程模型
- 从技术上讲,您应始终使用
- 经线大小始终为32,但是您可以:
- 在编译时,在设备代码中使用特殊变量
warpSize
(在 CUDA中记录编程指南,在内置变量"下,在4.1版的B.4节中) - 在运行时,使用cudaDeviceProp结构的warpSize字段(记录在 CUDA参考手册中)
- 在编译时,在设备代码中使用特殊变量
请注意,某些SDK示例(尤其是缩小和扫描)使用了这种扭曲同步技术.
If it was absolutely required for all the threads in a block to be at the same point in the code, do we require the __syncthreads function if the number of threads being launched is equal to the number of threads in a warp?
Note: No extra threads or blocks, just a single warp for the kernel.
Example code:
shared _voltatile_ sdata[16];
int index = some_number_between_0_and_15;
sdata[tid] = some_number;
output[tid] = x ^ y ^ z ^ sdata[index];
Updated with more information about using volatile
Presumably you want all threads to be at the same point since they are reading data written by other threads into shared memory, if you are launching a single warp (in each block) then you know that all threads are executing together. On the face of it this means you can omit the __syncthreads()
, a practice known as "warp-synchronous programming". However, there are a few things to look out for.
- Remember that a compiler will assume that it can optimise providing the intra-thread semantics remain correct, including delaying stores to memory where the data can be kept in registers.
__syncthreads()
acts as a barrier to this and therefore ensures that the data is written to shared memory before other threads read the data. Usingvolatile
causes the compiler to perform the memory write rather than keep in registers, however this has some risks and is more of a hack (meaning I don't know how this will be affected in the future)- Technically, you should always use
__syncthreads()
to conform with the CUDA Programming Model
- Technically, you should always use
- The warp size is and always has been 32, but you can:
- At compile time use the special variable
warpSize
in device code (documented in the CUDA Programming Guide, under "built-in variables", section B.4 in the 4.1 version) - At run time use the warpSize field of the cudaDeviceProp struct (documented in the CUDA Reference Manual)
- At compile time use the special variable
Note that some of the SDK samples (notably reduction and scan) use this warp-synchronous technique.
这篇关于经线内的CUDA __syncthreads()用法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!