GPU组如何进入扭曲/波阵面? [英] How does a GPU group threads into warps/wavefronts?
问题描述
我的理解是,warp是在运行时通过任务调度程序定义的一组线程,CUDA的性能关键部分是warp中的线程差异,有没有办法很好地猜测硬件的方式会在线程块内构造经线吗?
My understanding is that warp is a group of threads that defined at runtime through the task scheduler, one performance critical part of CUDA is the divergence of threads within a warp, is there a way to make a good guess of how the hardware will construct warps within a thread block?
例如,我已经在一个线程块中启动了一个具有1024个线程的内核,那么线程的排列方式如何,我能否从线程索引中得知(或至少可以作一个很好的猜测)?
For instance I have start a kernel with 1024 threads in a thread block, how is the warps be arranged, can I tell that (or at least make a good guess) from the thread index?
通过这样做,可以最大程度地减少给定经线内线程的差异.
Since by doing this, one can minimize the divergence of threads within a given warp.
推荐答案
warp中的线程安排与实现有关,但是atm我始终遇到相同的行为:
The thread arrangement inside the warp is implementation dependant but atm I have experienced always the same behavior:
warp由32个线程组成,但是warp调度程序每次会发出1条指令来暂停一次warp(16个线程)
A warp is composed by 32 threads but the warp scheduller will issue 1 instruction for halp a warp each time (16 threads)
-
如果使用1D块(仅threadIdx.x尺寸有效),则扭曲调度程序将针对 threadIdx.x =(0..15)(16..31)发出1条指令 ...等等
如果使用2D块(threadIdx.x和threadIdx.y尺寸有效),那么扭曲调度程序将尝试按照以下方式发出:
If you use 2D blocks (threadIdx.x and threadIdx.y dimension are valid) then the warp scheduller will try to issue following this fashion:
threadIdx.y = 0 threadIdx.x =(0 ..15)(16..31) ...等
因此,具有连续threadIdx.x组件的线程将以16为一组执行同一条指令.
so, the threads with consecutive threadIdx.x component will execute the same instruction in groups of 16.
这篇关于GPU组如何进入扭曲/波阵面?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!