GPU组如何进入扭曲/波阵面? [英] How does a GPU group threads into warps/wavefronts?

查看:118
本文介绍了GPU组如何进入扭曲/波阵面?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的理解是,warp是在运行时通过任务调度程序定义的一组线程,CUDA的性能关键部分是warp中的线程差异,有没有办法很好地猜测硬件的方式会在线程块内构造经线吗?

My understanding is that warp is a group of threads that defined at runtime through the task scheduler, one performance critical part of CUDA is the divergence of threads within a warp, is there a way to make a good guess of how the hardware will construct warps within a thread block?

例如,我已经在一个线程块中启动了一个具有1024个线程的内核,那么线程的排列方式如何,我能否从线程索引中得知(或至少可以作一个很好的猜测)?

For instance I have start a kernel with 1024 threads in a thread block, how is the warps be arranged, can I tell that (or at least make a good guess) from the thread index?

通过这样做,可以最大程度地减少给定经线内线程的差异.

Since by doing this, one can minimize the divergence of threads within a given warp.

推荐答案

warp中的线程安排与实现有关,但是atm我始终遇到相同的行为:

The thread arrangement inside the warp is implementation dependant but atm I have experienced always the same behavior:

warp由32个线程组成,但是warp调度程序每次会发出1条指令来暂停一次warp(16个线程)

A warp is composed by 32 threads but the warp scheduller will issue 1 instruction for halp a warp each time (16 threads)

  • 如果使用1D块(仅threadIdx.x尺寸有效),则扭曲调度程序将针对 threadIdx.x =(0..15)(16..31)发出1条指令 ...等等

如果使用2D块(threadIdx.x和threadIdx.y尺寸有效),那么扭曲调度程序将尝试按照以下方式发出:

If you use 2D blocks (threadIdx.x and threadIdx.y dimension are valid) then the warp scheduller will try to issue following this fashion:

threadIdx.y = 0 threadIdx.x =(0 ..15)(16..31) ...等

因此,具有连续threadIdx.x组件的线程将以16为一组执行同一条指令.

so, the threads with consecutive threadIdx.x component will execute the same instruction in groups of 16.

这篇关于GPU组如何进入扭曲/波阵面?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆