在CUDA内核中,如何将数组存储在“本地线程存储器”中? [英] In a CUDA kernel, how do I store an array in "local thread memory"?

查看:356
本文介绍了在CUDA内核中,如何将数组存储在“本地线程存储器”中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图用CUDA开发一个小程序,但由于它是SLOW我做了一些测试和googled有点。我发现,虽然单个变量是默认存储在本地线程内存中,数组通常不是。我想这就是为什么它需要这么多的时间来执行。现在我不知道:因为本地线程内存应该至少为16KB,并且由于我的数组只有52个字符长,是否有任何方式(语法请:))将它们存储在本地内存?



不应该是这样的:

  __ global__ my_kernel(int a)
{
__local__ unsigned char p [50];
}


解决方案

数组,本地内存和寄存器



这里有一个关于本地内存定义的误解。 CUDA中的本地存储器实际上是全局存储器(实际上应该被称为线程本地全局存储器),其具有交错寻址(使得在并行阵列上的迭代比每个线程的数据被阻塞在一起快一点)。如果你想要的东西真的很快,你想使用共享内存,或更好的寄存器(特别是在最新的设备,你得到高达255个寄存器每个线程)。解释整个 CUDA内存层次结构超出本文的范围。让我们集中于快速进行小数组计算。



小数组,就像变量一样可以完全存储在寄存器中。然而,在当前的NVIDIA硬件上,将阵列放入寄存器是困难的。为什么?因为寄存器需要非常仔细的处理。如果你不正确地做,你的数据将最终在本地内存(这又是真正的全局内存,这是最慢的内存)。 CUDA编程指南,第5.3.2节告诉您本地记忆体的使用时间:


本地记忆



本地内存访问仅对某些自动变量发生,如变量类型限定符中所述。编译器可能在本地内存中放置的自动变量是:


  1. 数组不能确定它们是用常量索引的,

  2. 会占用太多寄存器空间的大型结构或数组,

  3. 如果内核使用的寄存器数量多于可用数注册溢出)。




注册分配如何工作?



注意,寄存器分配是一个非常复杂的过程,这就是为什么你不能(也不应该)干扰它。相反,编译器会将CUDA代码转换为PTX代码(一种字节码),该代码假定一个具有无限寄存器的机器。你可以写入内联PTX,但它不会做太多的注册分配。 PTX代码是设备无关代码,它只是第一阶段。在第二阶段,PTX将被编译为器件汇编代码,称为SASS。 SASS代码具有实际的寄存器分配。 SASS编译器及其优化器也将是关于变量是在寄存器还是本地存储器中的最终权威。所有你能做的是试图理解SASS编译器在某些情况下做什么,并使用它为您的优势。 Nsight中的代码相关视图可以帮助您(见下文)。但是,由于编译器和优化器不断变化,因此不能保证寄存器中将会有或不会有寄存器。



寄存器不足



附录G,第1节告诉您线程可以有多少个寄存器。查找每个线程的最大32位寄存器数。为了解释该表,您必须知道您的计算能力(见下文)。不要忘记寄存器用于各种各样的事情,而不只是与单个变量相关联。所有高达CC 3.5的设备上的寄存器均为32位。如果编译器足够聪明(并且CUDA编译器不断更改),它可以例如将多个字节打包到同一个寄存器中。 Nsight代码相关视图(见下文分析内存访问)也揭示了。



常量与动态索引



虽然空间约束是寄存器数组的一个明显障碍,但容易监视的事情是,在当前硬件(Compute Capability 3.x及以下版本)上,编译器将任何数组放置在本地内存中通过动态索引访问。动态索引是编译器无法确定的索引。使用动态索引访问的数组不能放置在寄存器中,因为寄存器必须由编译器确定,因此所使用的实际寄存器不能依赖于在运行时确定的值。例如,给定数组 arr arr [k] 是常数索引 if和只有 k 是一个常数,或者只取决于常量。如果 k 以任何方式取决于一些非常量值,编译器无法计算 k 的值,您已获得动态索引。在 k 开始和结束于(小)常数数字的循环中,编译器(最可能)可以展开循环,并且仍然可以实现不变的索引。



示例



例如,可以在寄存器中对一个小数组进行排序,但必须使用排序网络或类似的硬连线方法,并且不能仅使用标准算法,因为大多数算法使用动态索引。



在下面的代码示例中,编译器将整个 aBytes 数组保存在寄存器中,它不会太大,并且循环可以完全展开(因为循环在固定范围内循环)。编译器(非常可能)知道每个步骤正在访问哪个寄存器,因此可以将其完全保存在寄存器中。请记住,没有保证。您可以做的最好的是使用CUDA开发人员工具逐案验证,如下所述。

  __global__ 
void
testSortingNetwork4(const char * aInput,char * aResult)
{
const int NBytes = 4;

char aBytes [NBytes];

//将输入复制到本地数组
for(int i = 0; i {
aBytes [i] = aInput [一世];
}

//使用排序网络排序
CompareAndSwap(aBytes,0,2); CompareAndSwap(aBytes,1,3);
CompareAndSwap(aBytes,0,1); CompareAndSwap(aBytes,2,3);
CompareAndSwap(aBytes,1,2);


//复制回结果数组
for(int i = 0; i {
aResult [ i] = aBytes [i];
}
}



分析内存存取



完成后,通常需要验证数据是否实际存储在寄存器中或者是否到达本地内存。您可以做的第一件事是告诉您的编译器给您内存统计使用 - ptxas-options = -v 标志。分析内存访问的更详细方法是使用 Nsight



Nsight有很多很酷的功能。 Nsight for Visual Studio 具有内置的分析器和CUDA< - > SASS代码相关视图。 此处介绍的功能。请注意,不同IDE的Nsight版本可能是独立开发的,因此它们的功能可能会因不同实现而异。



如果您按照上述链接中的说明在编译时添加相应的标志!),您可以在下级菜单的最底部找到CUDA Memory Transactions按钮。在该视图中,您想要发现没有来自仅在相应数组上工作的行的内存事务(例如,我的代码示例中的 CompareAndSwap 行)。因为如果它不报告任何内存访问这些行,你(很可能)能够保持整个计算在寄存器,可能只是已经获得了几千,如果不是百万分之百的速度(你可能还想检查实际速度增益,你得到这个!)。



计算计算能力



为了计算出你有多少注册数,你需要知道设备的计算能力。获取此类设备信息的标准方式是运行设备查询示例。对于Windows 64位上的CUDA 5.5,默认位于 C:\ProgramData \ NVIDIA Corporation \CUDA Samples\v5.5\Bin\win64\Release\deviceQuery.exe (在Windows上,控制台窗口将立即关闭,您可能需要先打开 cmd ,然后从那里运行它)。它在Linux和MAC上有类似的位置。



如果你有Visual Studio的Nsight,只需去Nsight - > Windows - >系统信息。 b
$ b

不要早点优化



我今天正在分享,因为我最近遇到这个问题。但是,如此主题中所述,强制数据在寄存器绝对不是你想要采取的第一步。首先,确保你真正了解发生了什么,然后逐步解决问题。查看汇编代码当然是一个很好的步骤,但它通常不应该是你的第一个。如果您是CUDA的新用户,请参阅 CUDA最佳做法指南将帮助您找出其中的一些步骤。


I'm trying to develop a small program with CUDA, but since it was SLOW I made some tests and googled a bit. I found out that while single variables are by default stored within the local thread memory, arrays usually aren't. I suppose that's why it takes so much time to execute. Now I wonder: since local thread memory should be at least of 16KB and since my arrays are just like 52 chars long, is there any way (syntax please :) ) to store them in local memory?

Shouldn't it be something like:

__global__ my_kernel(int a)
{
  __local__ unsigned char p[50];
}

解决方案

Arrays, local memory and registers

There is a misconception here regarding the definition of "local memory". "Local memory" in CUDA is actually global memory (and should really be called "thread-local global memory") with interleaved addressing (which makes iterating over an array in parallel a bit faster than having each thread's data blocked together). If you want things to be really fast, you want to use either shared memory, or, better yet, registers (especially on the latest devices where you get up to 255 registers per thread). Explaining the entire CUDA memory hierarchy falls out of scope of this post. Let us instead focus on making small array computations fast.

Small arrays, just like variables can be stored entirely in registers. On current NVIDIA hardware, however, putting arrays into registers is difficult. Why? Because registers need very careful treatment. If you don't do it exactly right, your data will end up in local memory (which, again, is really global memory, which is the slowest memory you have). The CUDA Programming Guide, section 5.3.2 tells you when local memory is used:

Local Memory

Local memory accesses only occur for some automatic variables as mentioned in Variable Type Qualifiers. Automatic variables that the compiler is likely to place in local memory are:

  1. Arrays for which it cannot determine that they are indexed with constant quantities,
  2. Large structures or arrays that would consume too much register space,
  3. Any variable if the kernel uses more registers than available (this is also known as register spilling).

How does register allocation work?

Note that register allocation is an extremely complicated process which is why you cannot (and should not) interfere with it. Instead, the compiler will convert CUDA code into PTX code (a sort of bytecode) which assumes a machine with infinitely many registers. You can write inline PTX but it won't do too much to register allocation. PTX code is device-independent code and it is only the first stage. In a second stage, PTX will be compiled into device assembly code, called SASS. SASS code has the actual register allocations. The SASS compiler and it's optimizer will also be the ultimate authority on whether a variable will be in registers or local memory. All you can do is try to understand what the SASS compiler does in certain cases and use that for your advantage. Code correlation view in Nsight can help you with that (see below). However, since the compiler and optimizer keep changing, there is no guarantees as to what will or will not be in registers.

Insufficient registers

Appendix G, section 1 tells you how many registers a thread can have. Look for "Maximum number of 32-bit registers per thread". In order to interpret that table, you must know your compute capability (see below). Don't forget that registers are used for all kinds of things, and don't just correlate to single variables. Registers on all devices up to CC 3.5 are 32 bit each. If the compiler is smart enough (and the CUDA compiler keeps changing), it can for example pack multiple bytes into the same register. The Nsight code correlation view (see "Analyzing Memory Accesses" below) also reveals that.

Constant vs. Dynamic Indexing

While the space constraint is an obvious hurdle to in-register arrays, the thing that is easily overseen is the fact that, on current hardware (Compute Capability 3.x and below), the compiler places any array in local memory that is accessed with dynamic indexing. A dynamic index is an index which the compiler cannot figure out. Arrays accessed with dynamic indices can't be placed in registers because registers must be determined by the compiler, and thus the actual register being used must not depend on a value determined at run-time. For example, given an array arr, arr[k] is constant indexing if and only if k is a constant, or only depends on constants. If k, in any way, depends on some non-constant value, the compiler cannot compute the value of k and you got dynamic indexing. In loops where k starts and ends at a (small) constant numbers, the compiler (most probably) can unroll your loop, and can still achieve constant indexing.

Example

For example, sorting a small array can be done in registers but you must use sorting networks or similarly "hard-wired" approaches, and can't just use a standard algorithm because most algorithms use dynamic indexing.

With quite a high probability, in the following code example, the compiler keeps the entire aBytes array in registers because it is not too large and the loops can fully be unrolled (because the loop iterates over a constant range). The compiler (very probably) knows which register is being accessed at every step and can thus keep it fully in registers. Keep in mind that there are no guarantees. The best you can do is to verify it on a case-by-case basis using CUDA developer tools, as described below.

__global__
void
testSortingNetwork4(const char * aInput, char * aResult)
{
    const int NBytes = 4;

    char aBytes[NBytes];

    // copy input to local array
    for (int i = 0; i < NBytes; ++i)
    {
        aBytes[i] = aInput[i];
    }

    // sort using sorting network
    CompareAndSwap(aBytes, 0, 2); CompareAndSwap(aBytes, 1, 3); 
    CompareAndSwap(aBytes, 0, 1); CompareAndSwap(aBytes, 2, 3); 
    CompareAndSwap(aBytes, 1, 2); 


    // copy back to result array
    for (int i = 0; i < NBytes; ++i)
    {
        aResult[i] = aBytes[i];
    }
}

Analyzing memory accesses

Once you are done, you generally want to verify whether the data is actually stored in registers or whether it went to local memory. The first thing you can do is to tell your compiler to give you memory statistics using the --ptxas-options=-v flag. A more detailed way of analyzing memory accesses is using Nsight.

Nsight has many cool features. Nsight for Visual Studio has a built-in profiler and a CUDA <-> SASS code correlation view. The feature is explained here. Note that Nsight versions for different IDEs are probably developed independently, and thus their features might vary between the different implementations.

If you follow the instructions in above link (make sure to add the corresponding flags when compiling!), you can find the "CUDA Memory Transactions" button at the very bottom of the lower menu. In that view, you want to find that there is no memory transaction coming from the lines that are only working on the corresponding array (e.g. the CompareAndSwap lines in my code example). Because if it does not report any memory access for those lines, you (very probably) were able to keep the entire computation in registers and might just have gained a speed up of thousands, if not tenthousands, of percent (You might also want to check the actual speed gain, you get out of this!).

Figuring out Compute Capability

In order to figure out how many registers you have, you need to know your device's compute capability. The standard way of getting such device information is running the device query sample. For CUDA 5.5 on Windows 64bit, that is by default located at C:\ProgramData\NVIDIA Corporation\CUDA Samples\v5.5\Bin\win64\Release\deviceQuery.exe (On Windows the console window will close immediately, you might want to open cmd first and run it from there). It has a similar location on Linux and MAC.

If you have Nsight for Visual Studio, just go to Nsight -> Windows -> System Info.

Don't optimize early

I am sharing this today because I came across this very problem very recently. However, as mentioned in this thread, forcing data to be in registers is definitely not the first step you want to take. First, make sure that you actually understand what is going on, then approach the problem step by step. Looking at the assembly code is certainly a good step, but it should generally not be your first. If you are new to CUDA, the CUDA Best Practices Guide will help you figure out some of those steps.

这篇关于在CUDA内核中,如何将数组存储在“本地线程存储器”中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆