如何填充2D线程块以进行warp调度? [英] How is the 2D thread blocks padded for warp scheduling?

查看:154
本文介绍了如何填充2D线程块以进行warp调度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我理解,对于具有31个线程的1D线程块,它将被填充到32个线程以进行warp执行。
带有31 * 31个线程的2D块怎么样?将为每个维度扭曲调度器垫1个附加线程(即,总共31个将被填充),或者该2D块线程将被连接,并且只有最后一个线程将被填充(31 * 31 = 961; 961%32 = 1) ?

I understand that for a 1D thread block with 31 threads, it will be padded to 32 threads for warp execution. What about a 2D block with 31*31 threads? Will warp scheduler pad 1 additional thread for each dimension (ie a total of 31 will be padded), or this 2D block threads will be concatenated and only the last thread will be padded (31*31=961; 961%32=1)?

推荐答案

只有一个warp(最后一个)被填充。线程按照x,y,z的顺序分组成warp。这样,如果你有一个奇怪的二维数组大小,如17x17,在内存中连续存储,你仍然可以从一个17x17线程块中生成32个线程的经纱,这将生成合并的访问。以这种方式,所有的经纱将产生完全聚结的访问,除了最后一个。如果在整个过程中单个经线都被填满了死线,在这个例子中,在内存访问方面会更加浪费。

Only one warp (the last one) gets padded. Threads are grouped into warps in the order x, y, z. In this way, if you have an odd 2D array size, like 17x17, that is stored contiguously in memory, you can still create 32-thread warps out of a 17x17 thread block that will generate coalesced accesses. In this way, all of the warps will generate fully coalesced accesses except the last one. If individual warps were padded with dead threads along the way, it would be more wasteful in terms of memory accesses in this example.

对于这个例子,至少,它工作

For this example, at least, it works better from a machine utilization standpoint.

这种文档支持基于理解线程 ID 和线程

The documentational support for this rests on understanding that thread ID and thread index are not the same.

给定线程的线程索引由内置变量 threadIdx.x 标识。 , threadIdx.y threadIdx.z

Thread index for a given thread is identified by the built-in variables threadIdx.x, threadIdx.y, and threadIdx.z. Thread ID is a unique (within the threadblock), scalar number assigned to each thread.

线程ID和线程索引之间的关系由此语句

The relationship between thread ID and thread index is given by this statement:


线程的索引及其线程ID以直接的方式相互关联:对于一维块,它们是相同的;对于尺寸为二维的块(Dx,Dy ),索引(x,y)的线程的线程ID是(x + y Dx);对于尺寸(Dx,Dy,Dz)的三维块,索引y,z)是(x + y Dx + z Dx Dy)。

"The index of a thread and its thread ID relate to each other in a straightforward way: For a one-dimensional block, they are the same; for a two-dimensional block of size (Dx, Dy),the thread ID of a thread of index (x, y) is (x + y Dx); for a three-dimensional block of size (Dx, Dy, Dz), the thread ID of a thread of index (x, y, z) is (x + y Dx + z Dx Dy). "

a href =http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture>由线程ID显式指定:

But the grouping of threads into warps is done explicitly by thread ID:


块被分割成经线的方式总是相同的;每个线程包含连续的,增加的线程ID的线程,第一个线程包含线程0。

"The way a block is partitioned into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0."

因此,基于第一条语句,我们看到即使对于像17x17这样奇怪的块形状,也没有定义除了那些之外的线程线程块的维度。然后基于第二个语句,通过线程ID的连续的经纱集合创建所有这些经纱中已经定义了纱线的纱线(除了最后一个之外)。

So based on the first statement, we see that even for an odd block shape like 17x17, there are no threads defined other than those which are within the dimensionality of the threadblock. Then based on the second statement, the consecutive assembly of warps by thread ID creates warps all of which have defined threads in them (except perhaps the last one.)

这篇关于如何填充2D线程块以进行warp调度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆