在 CUDA 内核中,如何将数组存储在“本地线程内存"中? [英] In a CUDA kernel, how do I store an array in "local thread memory"?

查看:23
本文介绍了在 CUDA 内核中,如何将数组存储在“本地线程内存"中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试用 CUDA 开发一个小程序,但由于它很慢,我做了一些测试并用谷歌搜索了一下.我发现虽然单个变量默认存储在本地线程内存中,但数组通常不是.我想这就是为什么要花这么多时间来执行.现在我想知道:由于本地线程内存应该至少为 16KB,而且由于我的数组就像 52 个字符一样长,有没有办法(请使用语法:))将它们存储在本地内存中?

I'm trying to develop a small program with CUDA, but since it was SLOW I made some tests and googled a bit. I found out that while single variables are by default stored within the local thread memory, arrays usually aren't. I suppose that's why it takes so much time to execute. Now I wonder: since local thread memory should be at least of 16KB and since my arrays are just like 52 chars long, is there any way (syntax please :) ) to store them in local memory?

不应该是这样的:

__global__ my_kernel(int a)
{
  __local__ unsigned char p[50];
}

推荐答案

数组、本地内存和寄存器

此处对本地内存"的定义存在误解.本地内存"在 CUDA 中,实际上是全局内存(并且应该真正称为线程局部全局内存"),具有交错寻址(这使得并行迭代数组比将每个线程的数据阻塞在一起要快一点).如果您希望事情变得非常快,您可以使用共享内存,或者更好的是使用寄存器(尤其是在每个线程最多 255 个寄存器的最新设备上).解释整个 CUDA 内存层次结构 不在本文讨论范围之内.让我们专注于快速进行小数组计算.

Arrays, local memory and registers

There is a misconception here regarding the definition of "local memory". "Local memory" in CUDA is actually global memory (and should really be called "thread-local global memory") with interleaved addressing (which makes iterating over an array in parallel a bit faster than having each thread's data blocked together). If you want things to be really fast, you want to use either shared memory, or, better yet, registers (especially on the latest devices where you get up to 255 registers per thread). Explaining the entire CUDA memory hierarchy falls out of scope of this post. Let us instead focus on making small array computations fast.

小数组,就像变量可以完全存储在寄存器中一样.然而,在当前的 NVIDIA 硬件上,将数组放入寄存器是很困难的.为什么?因为寄存器需要非常小心的处理.如果您做得不完全正确,您的数据将最终出现在本地内存中(这又是真正的全局内存,这是您拥有的最慢的内存).CUDA 编程指南,第 5.3.2 节 告诉您何时使用本地内存:

Small arrays, just like variables can be stored entirely in registers. On current NVIDIA hardware, however, putting arrays into registers is difficult. Why? Because registers need very careful treatment. If you don't do it exactly right, your data will end up in local memory (which, again, is really global memory, which is the slowest memory you have). The CUDA Programming Guide, section 5.3.2 tells you when local memory is used:

本地内存

局部内存访问只发生在变量类型限定符中提到的一些自动变量上.编译器可能放置在本地内存中的自动变量是:

Local memory accesses only occur for some automatic variables as mentioned in Variable Type Qualifiers. Automatic variables that the compiler is likely to place in local memory are:

  1. 无法确定它们是用常量索引的数组,
  2. 会占用过多寄存器空间的大型结构或数组,
  3. 内核使用的寄存器多于可用寄存器的任何变量(这也称为寄存器溢出).

寄存器分配如何工作?

请注意,寄存器分配是一个极其复杂的过程,这就是您不能(也不应该)干预它的原因.相反,编译器会将 CUDA 代码转换为 PTX 代码(一种字节码),该代码假定机器具有无限多个寄存器.您可以编写内联 PTX,但它不会对注册分配做太多事情.PTX 代码是与设备无关的代码,它只是第一阶段.在第二阶段,PTX 将被编译成设备汇编代码,称为 SASS.SASS 代码具有实际的寄存器分配.SASS 编译器及其优化器也将是决定变量是在寄存器中还是在本地内存中的最终权威.您所能做的就是尝试了解 SASS 编译器在某些情况下的作用,并将其用于您的优势.Nsight 中的代码关联视图可以帮助您(见下文).但是,由于编译器和优化器不断变化,因此无法保证寄存器中将包含或不包含的内容.

How does register allocation work?

Note that register allocation is an extremely complicated process which is why you cannot (and should not) interfere with it. Instead, the compiler will convert CUDA code into PTX code (a sort of bytecode) which assumes a machine with infinitely many registers. You can write inline PTX but it won't do too much to register allocation. PTX code is device-independent code and it is only the first stage. In a second stage, PTX will be compiled into device assembly code, called SASS. SASS code has the actual register allocations. The SASS compiler and it's optimizer will also be the ultimate authority on whether a variable will be in registers or local memory. All you can do is try to understand what the SASS compiler does in certain cases and use that for your advantage. Code correlation view in Nsight can help you with that (see below). However, since the compiler and optimizer keep changing, there is no guarantees as to what will or will not be in registers.

附录 G,第 1 部分 告诉您一个线程可以拥有多少个寄存器.查找每个线程的最大 32 位寄存器数".为了解释该表,您必须了解您的计算能力(见下文).不要忘记寄存器用于各种事物,而不仅仅是与单个变量相关.CC 3.5 以下的所有设备上的寄存器均为 32 位.如果编译器足够聪明(并且 CUDA 编译器不断变化),它可以例如将多个字节打包到同一个寄存器中.Nsight 代码相关性视图(请参阅下面的分析内存访问")也揭示了这一点.

Appendix G, section 1 tells you how many registers a thread can have. Look for "Maximum number of 32-bit registers per thread". In order to interpret that table, you must know your compute capability (see below). Don't forget that registers are used for all kinds of things, and don't just correlate to single variables. Registers on all devices up to CC 3.5 are 32 bit each. If the compiler is smart enough (and the CUDA compiler keeps changing), it can for example pack multiple bytes into the same register. The Nsight code correlation view (see "Analyzing Memory Accesses" below) also reveals that.

虽然空间限制是寄存器内数组的明显障碍,但容易监督的事实是,在当前硬件(计算能力 3.x 及以下)上,编译器将任何数组放置在本地内存中使用动态索引访问.动态索引是编译器无法计算的索引.使用动态索引访问的数组不能放在寄存器中,因为寄存器必须由编译器确定,因此实际使用的寄存器不能依赖于在运行时确定的值.例如,给定一个数组 arrarr[k]常量索引当且仅当 k 是一个常数,或仅取决于常数.如果 k 以任何方式依赖于某个非常量值,编译器将无法计算 k 的值,并且您会获得动态索引.在 k 以(小)常数开始和结束的循环中,编译器(最有可能)可以展开循环,并且仍然可以实现恒定索引.

While the space constraint is an obvious hurdle to in-register arrays, the thing that is easily overseen is the fact that, on current hardware (Compute Capability 3.x and below), the compiler places any array in local memory that is accessed with dynamic indexing. A dynamic index is an index which the compiler cannot figure out. Arrays accessed with dynamic indices can't be placed in registers because registers must be determined by the compiler, and thus the actual register being used must not depend on a value determined at run-time. For example, given an array arr, arr[k] is constant indexing if and only if k is a constant, or only depends on constants. If k, in any way, depends on some non-constant value, the compiler cannot compute the value of k and you got dynamic indexing. In loops where k starts and ends at a (small) constant numbers, the compiler (most probably) can unroll your loop, and can still achieve constant indexing.

例如,可以在寄存器中对小数组进行排序,但您必须使用排序网络或类似的硬连线"方法,并且不能只使用标准算法,因为大多数算法使用动态索引.

For example, sorting a small array can be done in registers but you must use sorting networks or similarly "hard-wired" approaches, and can't just use a standard algorithm because most algorithms use dynamic indexing.

很可能,在下面的代码示例中,编译器将整个 aBytes 数组保存在寄存器中,因为它不会太大并且循环可以完全展开(因为循环会重复恒定范围).编译器(很可能)知道每一步都在访问哪个寄存器,因此可以将其完全保存在寄存器中.请记住,没有任何保证.您能做的最好的事情是使用 CUDA 开发者工具逐案验证,如下所述.

With quite a high probability, in the following code example, the compiler keeps the entire aBytes array in registers because it is not too large and the loops can fully be unrolled (because the loop iterates over a constant range). The compiler (very probably) knows which register is being accessed at every step and can thus keep it fully in registers. Keep in mind that there are no guarantees. The best you can do is to verify it on a case-by-case basis using CUDA developer tools, as described below.

__global__
void
testSortingNetwork4(const char * aInput, char * aResult)
{
    const int NBytes = 4;

    char aBytes[NBytes];

    // copy input to local array
    for (int i = 0; i < NBytes; ++i)
    {
        aBytes[i] = aInput[i];
    }

    // sort using sorting network
    CompareAndSwap(aBytes, 0, 2); CompareAndSwap(aBytes, 1, 3); 
    CompareAndSwap(aBytes, 0, 1); CompareAndSwap(aBytes, 2, 3); 
    CompareAndSwap(aBytes, 1, 2); 


    // copy back to result array
    for (int i = 0; i < NBytes; ++i)
    {
        aResult[i] = aBytes[i];
    }
}

分析内存访问

完成后,您通常要验证数据是否实际存储在寄存器中或是否进入本地内存.您可以做的第一件事是告诉您的编译器使用 --ptxas-options=-v 标志.一种更详细的内存访问分析方法是使用 Nsight.

Nsight 有很多很酷的功能.Nsight for Visual Studio 有一个内置的分析器和一个 CUDA <->SASS 代码关联视图.该功能解释here.请注意,不同 IDE 的 Nsight 版本可能是独立开发的,因此它们的功能可能因不同的实现而异.

Nsight has many cool features. Nsight for Visual Studio has a built-in profiler and a CUDA <-> SASS code correlation view. The feature is explained here. Note that Nsight versions for different IDEs are probably developed independently, and thus their features might vary between the different implementations.

如果您按照上面链接中的说明进行操作(确保在编译时添加相应的标志!),您可以找到CUDA Memory Transactions"位于下方菜单最底部的按钮.在该视图中,您想发现没有来自仅在相应数组上工作的行(例如我的代码示例中的 CompareAndSwap 行)的内存事务.因为如果它没有报告对这些行的任何内存访问,您(很可能)能够将整个计算保存在寄存器中,并且可能只是获得了数千个甚至数千个百分比的加速(您可能还想检查实际速度增益,您就可以摆脱这种情况!).

If you follow the instructions in above link (make sure to add the corresponding flags when compiling!), you can find the "CUDA Memory Transactions" button at the very bottom of the lower menu. In that view, you want to find that there is no memory transaction coming from the lines that are only working on the corresponding array (e.g. the CompareAndSwap lines in my code example). Because if it does not report any memory access for those lines, you (very probably) were able to keep the entire computation in registers and might just have gained a speed up of thousands, if not tenthousands, of percent (You might also want to check the actual speed gain, you get out of this!).

为了确定您拥有多少个寄存器,您需要了解您设备的计算能力.获取此类设备信息的标准方法是运行设备查询示例.对于 64 位 Windows 上的 CUDA 5.5,默认位于 C:\ProgramData\NVIDIA Corporation\CUDA Samples\v5.5\Bin\win64\Release\deviceQuery.exe(在 Windows 上,控制台窗口将立即关闭,您可能想先打开 cmd 并从那里运行它).它在 Linux 和 MAC 上的位置相似.

In order to figure out how many registers you have, you need to know your device's compute capability. The standard way of getting such device information is running the device query sample. For CUDA 5.5 on Windows 64bit, that is by default located at C:\ProgramData\NVIDIA Corporation\CUDA Samples\v5.5\Bin\win64\Release\deviceQuery.exe (On Windows the console window will close immediately, you might want to open cmd first and run it from there). It has a similar location on Linux and MAC.

如果你有 Visual Studio 的 Nsight,就​​去 Nsight ->窗口 ->系统信息.

If you have Nsight for Visual Studio, just go to Nsight -> Windows -> System Info.

我今天分享这个是因为我最近遇到了这个问题.但是,如此线程所述,强制数据在寄存器中绝对不是您想要采取的第一步.首先,确保您确实了解正在发生的事情,然后逐步解决问题.查看汇编代码当然是一个很好的步骤,但它通常不应该是您的第一个步骤.如果您不熟悉 CUDA,请参阅 CUDA 最佳实践指南将帮助您找出其中的一些步骤.

I am sharing this today because I came across this very problem very recently. However, as mentioned in this thread, forcing data to be in registers is definitely not the first step you want to take. First, make sure that you actually understand what is going on, then approach the problem step by step. Looking at the assembly code is certainly a good step, but it should generally not be your first. If you are new to CUDA, the CUDA Best Practices Guide will help you figure out some of those steps.

这篇关于在 CUDA 内核中,如何将数组存储在“本地线程内存"中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆