经线内的CUDA __syncthreads()用法 [英] CUDA __syncthreads() usage within a warp

查看:137
本文介绍了经线内的CUDA __syncthreads()用法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果绝对要求一个块中的所有线程都在代码中位于同一点,那么如果启动的线程数等于扭曲中的线程数,我们是否需要__syncthreads函数? /p>

注意:没有多余的线程或块,只是内核的一个扭曲.

示例代码:

shared _voltatile_ sdata[16];

int index = some_number_between_0_and_15;
sdata[tid] = some_number;
output[tid] = x ^ y ^ z ^ sdata[index];

解决方案

已更新有关使用易失性的更多信息

假定您希望所有线程处于同一点,因为它们正在将其他线程写入的数据读取到共享内存中,如果要启动单个扭曲(在每个块中),那么您就会知道所有线程正在一起执行.从表面上看,这意味着您可以省略__syncthreads(),这是一种称为翘曲同步编程"的做法.但是,有一些注意事项.

  • 请记住,只要线程内语义保持正确,编译器将假定它可以优化,包括将存储延迟到内存中,以便将数据保存在寄存器中. __syncthreads()对此有障碍,因此可以确保在其他线程读取数据之前将数据写入共享内存.使用volatile会使编译器执行内存写操作,而不是保留在寄存器中,但这会带来一些风险,并且更容易被黑客入侵(这意味着我不知道将来如何影响)
    • 从技术上讲,您应始终使用__syncthreads()来符合CUDA编程模型
  • 经线大小始终为32,但是您可以:

请注意,某些SDK示例(尤其是缩小和扫描)使用了这种扭曲同步技术.

If it was absolutely required for all the threads in a block to be at the same point in the code, do we require the __syncthreads function if the number of threads being launched is equal to the number of threads in a warp?

Note: No extra threads or blocks, just a single warp for the kernel.

Example code:

shared _voltatile_ sdata[16];

int index = some_number_between_0_and_15;
sdata[tid] = some_number;
output[tid] = x ^ y ^ z ^ sdata[index];

解决方案

Updated with more information about using volatile

Presumably you want all threads to be at the same point since they are reading data written by other threads into shared memory, if you are launching a single warp (in each block) then you know that all threads are executing together. On the face of it this means you can omit the __syncthreads(), a practice known as "warp-synchronous programming". However, there are a few things to look out for.

  • Remember that a compiler will assume that it can optimise providing the intra-thread semantics remain correct, including delaying stores to memory where the data can be kept in registers. __syncthreads() acts as a barrier to this and therefore ensures that the data is written to shared memory before other threads read the data. Using volatile causes the compiler to perform the memory write rather than keep in registers, however this has some risks and is more of a hack (meaning I don't know how this will be affected in the future)
    • Technically, you should always use __syncthreads() to conform with the CUDA Programming Model
  • The warp size is and always has been 32, but you can:
    • At compile time use the special variable warpSize in device code (documented in the CUDA Programming Guide, under "built-in variables", section B.4 in the 4.1 version)
    • At run time use the warpSize field of the cudaDeviceProp struct (documented in the CUDA Reference Manual)

Note that some of the SDK samples (notably reduction and scan) use this warp-synchronous technique.

这篇关于经线内的CUDA __syncthreads()用法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆