在一维网格中计算经向ID/车道ID的最有效方法是什么? [英] What's the most efficient way to calculate the warp id / lane id in a 1-D grid?

查看:437
本文介绍了在一维网格中计算经向ID/车道ID的最有效方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在CUDA中,每个线程都知道其在网格中的块索引以及该块内的线程索引.但是似乎没有两个重要的值可供使用:

In CUDA, each thread knows its block index in the grid and thread index within the block. But two important values do not seem to be explicitly available to it:

  • 其索引为经线内的车道(其车道ID")
  • 它是块内泳道的经线索引(其经线ID")

假设网格是一维的(又称线性,即blockDim.yblockDim.z为1),显然可以这样获得:

Assuming the grid is 1-dimensional(a.k.a. linear, i.e. blockDim.y and blockDim.z are 1), one can obviously obtain these as follows:

enum : unsigned { warp_size = 32 };
auto lane_id = threadIdx.x % warp_size;
auto warp_id = threadIdx.x / warp_size;

,如果您不信任编译器进行优化,则可以将其重写为:

and if you don't trust the compiler to optimize that, you could rewrite it as:

enum : unsigned { warp_size = 32, log_warp_size = 5 };
auto lane_id = threadIdx.x & (warp_size - 1);
auto warp_id = threadIdx.x >> log_warp_size;

这是最有效的方法吗?似乎每个线程都必须浪费很多时间来计算它.

is that the most efficient thing to do? It still seems like a lot of waste for every thread to have to compute this.

(受此问题启发.)

(inspired by this question.)

推荐答案

天真的计算是目前效率最高的.

注意:此答案已过大量编辑.

尝试完全避免计算是非常诱人的-因为如果您仔细看一下,这两个值似乎已经可用.

It is very tempting to try and avoid the computation altogether - as these two values seem to already be available if you look under the hood.

您会看到,nVIDIA GPU具有特殊的寄存器,您的(编译后的)代码可以读取这些寄存器以访问各种有用的信息.其中一个这样的寄存器保存threadIdx.x;另一个持有blockDim.x;另一个-时钟滴答计数;等等.显然,C ++作为一种语言没有公开这些内容.实际上,CUDA也没有.但是,将CUDA代码编译到的中间表示形式称为 PTX 确实公开了这些特殊寄存器(自PTX 1.3起,即CUDA版本> = 2.1).

You see, nVIDIA GPUs have special registers which your (compiled) code can read to access various kinds of useful information. One such register holds threadIdx.x; another holds blockDim.x; another - the clock tick count; and so on. C++ as a language does not have these exposed, obviously; and, in fact, neither does CUDA. However, the intermediary representation into which CUDA code is compiled, named PTX, does expose these special registers (since PTX 1.3, i.e. with CUDA versions >= 2.1).

这些特殊寄存器中的两个是%warpid%laneid.现在,CUDA支持使用asm关键字在CUDA代码中内联PTX代码-就像它可用于主机端代码直接发出CPU汇编指令一样.通过这种机制,可以使用以下特殊寄存器:

Two of these special registers are %warpid and %laneid. Now, CUDA supports inlining PTX code within CUDA code with the asm keyword - just like it can be used for host-side code to emit CPU assembly instructions directly. With this mechanism one can use these special registers:

__forceinline__ __device__ unsigned lane_id()
{
    unsigned ret; 
    asm volatile ("mov.u32 %0, %laneid;" : "=r"(ret));
    return ret;
}

__forceinline__ __device__ unsigned warp_id()
{
    // this is not equal to threadIdx.x / 32
    unsigned ret; 
    asm volatile ("mov.u32 %0, %warpid;" : "=r"(ret));
    return ret;
}

...但是这里有两个问题.

... but there are two problems here.

第一个问题-如@Patwie所建议-是%warp_id不能提供您真正想要的东西-它不是网格上下文中的扭曲索引,而是物理SM上下文中的索引(一次可以容纳这么多的经线),而这两个不一样.因此,请勿使用%warp_id .

The first problem - as @Patwie suggests - is that %warp_id does not give you what you actually want - it's not the index of the warp in the context of the grid, but rather in the context of the physical SM (which can hold so many warps resident at a time), and those two are not the same. So don't use %warp_id.

对于%lane_id,它的确为您提供了正确的值,但它具有误导性,即性能不佳:即使它是寄存器",也不像您的寄存器文件中的常规寄存器那样具有1周期访问延迟.这是一个特殊的寄存器,在实际硬件中为

As for %lane_id, it does give you the correct value, but it's misleadingly non-performant: Even though it's a "register", it's not like the regular registers in your register file, with 1-cycle access latency. It's a special register, which in the actual hardware is retrieved using an S2R instruction, which can exhibit long latency.


底线:只需从线程ID计算翘曲ID和线程ID.我们暂时无法解决这个问题.


Bottom line: Just compute the warp ID and thread ID from the thread ID. We can't get around this - for now.

这篇关于在一维网格中计算经向ID/车道ID的最有效方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆