为什么CUDA中存在扭曲级同步原语? [英] Why is there a warp-level synchronization primitive in CUDA?

查看:112
本文介绍了为什么CUDA中存在扭曲级同步原语?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于CUDA中的 __ syncwarp(),我有两个问题:

I have two questions regarding __syncwarp() in CUDA:

  1. 如果我理解正确,那么将在SIMD功能中执行CUDA中的扭曲.难道不意味着warp中的所有线程都始终同步吗?如果是这样, __ syncwarp()到底是做什么的,为什么有必要?
  2. 假设我们启动了一个内核,该内核的块大小为1024,其中一个块中的线程分为每组32个线程的组.每个线程都通过共享内存与该组中的其他线程通信,但不与该组外的任何线程通信.在这样的内核中,我可以看到比 __ syncthreads()更细粒度的同步可能有用,但是由于将块分成的扭曲可能与组不匹配,因此如何保证正确性?使用 __ syncwarp()?
  1. If I understand correctly, a warp in CUDA is executed in an SIMD fasion. Does that not imply that all threads in a warp are always synchronized? If so, what exactly does __syncwarp() do, and why is it necessary?
  2. Say we have a kernel launched with a block size of 1024, where the threads within a block are divided into groups of 32 threads each. Each thread communicates with other threads in it's group via shared memory, but does not communicate with any thread outside it's group. In such a kernel, I can see how a more granular synchronization than __syncthreads() may be useful, but since the warps the block is split into may not match with the groups, how would one guarantee correctness when using __syncwarp()?

推荐答案

如果我理解正确,那么将在SIMD功能中执行CUDA中的扭曲.难道这并不意味着扭曲中的所有线程总是同步的吗?

If I understand correctly, a warp in CUDA is executed in an SIMD fasion. Does that not imply that all threads in a warp are always synchronized?

不.可能存在warp级执行差异(通常是分支,但也可能是warp shuffle,表决和谓词执行等其他事物),由指令重播或执行屏蔽处理.请注意,在现代" CUDA中,隐式扭曲同步编程是不再安全,因此翘曲级别同步不仅是理想的,而且是强制性的.

No. There can be warp level execution divergence (usually branching, but can be other things like warp shuffles, voting, and predicated execution), handled by instruction replay or execution masking. Note that in "modern" CUDA, implicit warp synchronous programming is no longer safe, thus warp level synchronization is not just desirable, it is mandatory.

如果是这样,__ syncwarp()的确切功能是什么,为什么有必要?

If so, what exactly does __syncwarp() do, and why is it necessary?

因为 可能会出现翘曲级别执行差异,所以这是在差异翘曲中实现同步的方式.

Because there can be warp level execution divergence, and this is how synchronization within a divergent warp is achieved.

假设我们启动了一个内核,其块大小为1024,其中一个块中的线程分为32个线程的组.每个线程都通过共享内存与该组中的其他线程通信,但不与该组外的任何线程通信.在这样的内核中,我可以看到比__syncthreads()更细粒度的同步可能有用,但是由于将块拆分为warp可能与组不匹配,因此在使用__syncwarp()时如何保证正确性?

通过确保始终使用计算的经纱边界(或适当的线程掩码)显式执行拆分.

By ensuring that the split is always performed explicitly using calculated warp boundaries (or a suitable thread mask).

这篇关于为什么CUDA中存在扭曲级同步原语?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆