在CUDA内核,我如何存储和QUOT阵列;本地线程内存和QUOT ;? [英] In a CUDA kernel, how do I store an array in "local thread memory"?

查看:201
本文介绍了在CUDA内核,我如何存储和QUOT阵列;本地线程内存和QUOT ;?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图发展与CUDA的小程序,但由于它是缓慢的我做了一些测试,并用Google搜索了一下。我发现,虽然单一变量是存储在本地线程内存中默认情况​​下,阵列通常不是。我想这就是为什么需要这么多的时间来执行。现在我不知道:因为本地线程的内存至少应为16KB和因为我的阵列就像是52个字符长,有什么办法(语法请:))将它们存储在本地内存

难道不应该是这样的:

  __ global__ my_kernel(int类型的)
{
  __local__ unsigned char型P [50];
}


解决方案

阵列,本地存储器和寄存器

有是关于本地内存的定义在这里有误解。 本地存储,在CUDA实际上是全局内存(并应该被称为线程本地全局内存)与交错寻址(这使得在迭代并行阵列有点不是让每个线程的数据阻塞一起更快)。如果你想要的东西是真快,要使用共享内存,或者更好的,寄存器(尤其是你在哪里爬起来,以每个线程255寄存器最新的设备)。说明整个<一个href=\"https://graphics.cg.uni-saarland.de/fileadmin/cguds/courses/ws1213/pp_cuda/slides/02_-_CUDA_Memory_Hierarchy.pdf\">CUDA存储层次掉下来的这个帖子的范围。相反,让我们的注意力集中在制作小数组计算。

小数组,就像变量可以全部存储在寄存器中。在当前NVIDIA硬件,但是,把阵列到寄存器是困难的。为什么?因为寄存器需要非常小心处理。如果你不这样做完全正确,你的数据将在本地内存(这又确实是全球性的内存,这是你最慢的内存)结束。该 CUDA编程指南,第5.3.2节告诉你当本地内存使用:


  

本地内存


  
  

在变量类型限定符提到的本地内存访问只发生一些自动变量。自动变量,编译器可能是在本地内存中放置有:


  
  

      
  1. 阵列用于它无法确定,他们是根据常量索引,

  2.   
  3. 大型结构或数组,将消耗太多寄存器空间,

  4.   
  5. 如果内核使用比可用多个寄存器的任何变量(这也被称为寄存器溢出)。

  6.   

如何寄存器分配工作?

请注意寄存器分配是一个非常复杂的过程,这就是为什么你不能(也不应该)干扰它。相反,编译器将其转换CUDA code到PTX code(一种字节code的),它假定一台具有无穷多个寄存器。你可以写内嵌PTX,但它不会做太多的寄存器分配。 PTX code是独立于设备的code和这仅仅是第一阶段。在第二阶段,PTX会被编译成器件组装code,被称为上海社会科学院。上海社会科学院code的实际寄存器分配。该SASS编译器和它的优化也将在一个变量是否会在寄存器或本地内存中的最终权力。你所能做的就是尽量去了解SASS编译器在某些情况下什么,并使用你的优势。在Nsight code相关视图可以帮你(见下文)。然而,由于编译器和优化不断变化,没有担保,以什么会或不会在寄存器中。

不足寄存器

附录G,第1节的告诉你一个线程可以有多少寄存器具有。寻找的每个线程32位寄存器的最大数量。为了跨preT该表,你必须知道你的计算能力(见下文)。不要忘记,寄存器用于各种事情,不只是关联到单个变量。就高达3.5 CC所有设备寄存器是32位的每一个。如果编译器足够聪明(和CUDA编译器不断变化),例如,它可以打包多个字节到同一个寄存器。该Nsight code相关视图(参见分析内存访问下)也显露。

恒与动态索引

尽管空间约束是一个明显的障碍在寄存器阵列,即容易监督的事情是,在当前的硬件(计算能力3.x和下面),编译器将在本地存储器中的任何阵列的事实与动态索引访问。动态指数是编译器不能找出一个索引。动态索引进行访问阵列不能被放置在寄存器因为寄存器必须由编译器来确定,并且因此被使用的实际寄存器必须不依赖于在运行时确定的值。例如,给定一个数组改编改编[K] 恒索引的,如果和只有当 K 是一个常量,或只依赖于常数。如果 K ,以任何方式,依赖于一些非恒定值,编译器不能计算 k值和你得到的动态索引的。在 K 开始,并在(小)常数结束循环,编译器(最有可能),可以展开你的循环,而且还可以实现恒索引。

示例

例如,整理一小阵,可以在寄存器中完成,但必须使用排序网络或类似的硬有线的方法,而不能只用一个标准的算法,因为大多数算法使用动态索引。

通过相当高的概率,在以下code例如,编译器保持在寄存器中的整个 aBytes 阵列,因为它不是太大,循环可以充分地展开(因为在一个恒定的范围内循环迭代)。编译器(很有可能)知道哪个寄存器在每一步被访问,因此可以在寄存器中完全保留它。请记住,没有保证。你能做的最好是使用CUDA开发工具,以验证其对案件逐案基础上,如下所述。

  __ global__
空虚
testSortingNetwork4(为const char * aInput,字符*综合因素的产物)
{
    const int的参数nBytes = 4;    焦炭aBytes [返回nbytes]    //拷贝输入到本地阵列
    的for(int i = 0; I&LT;为nbytes ++ I)
    {
        aBytes [I] = aInput [I]
    }    //排序使用排序网络
    CompareAndSwap(aBytes,0,2); CompareAndSwap(aBytes,1,3);
    CompareAndSwap(aBytes,0,1); CompareAndSwap(aBytes,2,3);
    CompareAndSwap(aBytes,1,2);
    //复制回结果数组
    的for(int i = 0; I&LT;为nbytes ++ I)
    {
        综合因素的产物[I] = aBytes [I]
    }
}

分析内存访问

一旦你做,你通常要验证数据是否实际存储在寄存器或是否去了本地内存。你可以做的第一件事就是告诉你的编译器给你的内存统计使用 - ptxas选项= -v 标志。分析内存更详细的存取方式使用 Nsight

Nsight有许多很酷的功能。 Nsight为Visual Studio 有一个内置分析器和CUDA&LT; - >上海社会科学院code相关视图。 <一href=\"http://developer.nvidia.com/content/cuda-pro-tip-view-assembly-$c$c-correlation-nsight-visual-studio-edition\">The特点是这里解释。需要注意的是Nsight版本不同的IDE很可能独立开发,因而其功能可能不同的实现之间变化。

如果您按照上面的链接中的说明(请务必增加编译时相应的标志!),你可以在下级菜单的最下方找到CUDA内存事务按钮。在该视图中,要找到没有内存交易从仅在相应的阵列上工作线来(例如 CompareAndSwap 的在我的code例如行)。因为如果它不报告对这些线路的内存访问,你(很可能)能够保持整个计算在寄存器中,可能只是还获得了数千加快,如果不tenthousands,百分比的(也可能要检查实际速度增益,你离开这里!)。

搞清楚计算能力

为了弄清楚你有多少个寄存器,你需要知道你的设备的计算能力。获得这样的设备信息的标准方式是运行设备查询样品。对于Windows 64位CUDA 5.5,这是在默认情况下位于 C:\\ ProgramData \\ NVIDIA公司\\ CUDA样本\\ V5.5 \\ BIN \\ Win64的\\发布\\ deviceQuery.exe 的(在Windows控制台窗口立即关闭,你可能会想打开 CMD 第一,并从那里运行它)。它在Linux和Mac相似的位置。

如果您有Nsight为Visual Studio,只要到Nsight - >窗口 - >系统信息

不要优化早

因为我碰到这个问题非常非常来到最近我分享今天这个。然而,在这个线程,将导致数据在寄存器绝对不是你想迈出第一步。首先,确保你真正明白是怎么回事,那么解决这个问题一步一步来。看着组装code无疑是一个很好的一步,但它通常不应该是你的第一。如果你是新的CUDA,在 CUDA最佳实践指南将帮助你找出其中的一些步骤。

I'm trying to develop a small program with CUDA, but since it was SLOW I made some tests and googled a bit. I found out that while single variables are by default stored within the local thread memory, arrays usually aren't. I suppose that's why it takes so much time to execute. Now I wonder: since local thread memory should be at least of 16KB and since my arrays are just like 52 chars long, is there any way (syntax please :) ) to store them in local memory?

Shouldn't it be something like:

__global__ my_kernel(int a)
{
  __local__ unsigned char p[50];
}

解决方案

Arrays, local memory and registers

There is a misconception here regarding the definition of "local memory". "Local memory" in CUDA is actually global memory (and should really be called "thread-local global memory") with interleaved addressing (which makes iterating over an array in parallel a bit faster than having each thread's data blocked together). If you want things to be really fast, you want to use either shared memory, or, better yet, registers (especially on the latest devices where you get up to 255 registers per thread). Explaining the entire CUDA memory hierarchy falls out of scope of this post. Let us instead focus on making small array computations fast.

Small arrays, just like variables can be stored entirely in registers. On current NVIDIA hardware, however, putting arrays into registers is difficult. Why? Because registers need very careful treatment. If you don't do it exactly right, your data will end up in local memory (which, again, is really global memory, which is the slowest memory you have). The CUDA Programming Guide, section 5.3.2 tells you when local memory is used:

Local Memory

Local memory accesses only occur for some automatic variables as mentioned in Variable Type Qualifiers. Automatic variables that the compiler is likely to place in local memory are:

  1. Arrays for which it cannot determine that they are indexed with constant quantities,
  2. Large structures or arrays that would consume too much register space,
  3. Any variable if the kernel uses more registers than available (this is also known as register spilling).

How does register allocation work?

Note that register allocation is an extremely complicated process which is why you cannot (and should not) interfere with it. Instead, the compiler will convert CUDA code into PTX code (a sort of bytecode) which assumes a machine with infinitely many registers. You can write inline PTX but it won't do too much to register allocation. PTX code is device-independent code and it is only the first stage. In a second stage, PTX will be compiled into device assembly code, called SASS. SASS code has the actual register allocations. The SASS compiler and it's optimizer will also be the ultimate authority on whether a variable will be in registers or local memory. All you can do is try to understand what the SASS compiler does in certain cases and use that for your advantage. Code correlation view in Nsight can help you with that (see below). However, since the compiler and optimizer keep changing, there is no guarantees as to what will or will not be in registers.

Insufficient registers

Appendix G, section 1 tells you how many registers a thread can have. Look for "Maximum number of 32-bit registers per thread". In order to interpret that table, you must know your compute capability (see below). Don't forget that registers are used for all kinds of things, and don't just correlate to single variables. Registers on all devices up to CC 3.5 are 32 bit each. If the compiler is smart enough (and the CUDA compiler keeps changing), it can for example pack multiple bytes into the same register. The Nsight code correlation view (see "Analyzing Memory Accesses" below) also reveals that.

Constant vs. Dynamic Indexing

While the space constraint is an obvious hurdle to in-register arrays, the thing that is easily overseen is the fact that, on current hardware (Compute Capability 3.x and below), the compiler places any array in local memory that is accessed with dynamic indexing. A dynamic index is an index which the compiler cannot figure out. Arrays accessed with dynamic indices can't be placed in registers because registers must be determined by the compiler, and thus the actual register being used must not depend on a value determined at run-time. For example, given an array arr, arr[k] is constant indexing if and only if k is a constant, or only depends on constants. If k, in any way, depends on some non-constant value, the compiler cannot compute the value of k and you got dynamic indexing. In loops where k starts and ends at a (small) constant numbers, the compiler (most probably) can unroll your loop, and can still achieve constant indexing.

Example

For example, sorting a small array can be done in registers but you must use sorting networks or similarly "hard-wired" approaches, and can't just use a standard algorithm because most algorithms use dynamic indexing.

With quite a high probability, in the following code example, the compiler keeps the entire aBytes array in registers because it is not too large and the loops can fully be unrolled (because the loop iterates over a constant range). The compiler (very probably) knows which register is being accessed at every step and can thus keep it fully in registers. Keep in mind that there are no guarantees. The best you can do is to verify it on a case-by-case basis using CUDA developer tools, as described below.

__global__
void
testSortingNetwork4(const char * aInput, char * aResult)
{
    const int NBytes = 4;

    char aBytes[NBytes];

    // copy input to local array
    for (int i = 0; i < NBytes; ++i)
    {
        aBytes[i] = aInput[i];
    }

    // sort using sorting network
    CompareAndSwap(aBytes, 0, 2); CompareAndSwap(aBytes, 1, 3); 
    CompareAndSwap(aBytes, 0, 1); CompareAndSwap(aBytes, 2, 3); 
    CompareAndSwap(aBytes, 1, 2); 


    // copy back to result array
    for (int i = 0; i < NBytes; ++i)
    {
        aResult[i] = aBytes[i];
    }
}

Analyzing memory accesses

Once you are done, you generally want to verify whether the data is actually stored in registers or whether it went to local memory. The first thing you can do is to tell your compiler to give you memory statistics using the --ptxas-options=-v flag. A more detailed way of analyzing memory accesses is using Nsight.

Nsight has many cool features. Nsight for Visual Studio has a built-in profiler and a CUDA <-> SASS code correlation view. The feature is explained here. Note that Nsight versions for different IDEs are probably developed independently, and thus their features might vary between the different implementations.

If you follow the instructions in above link (make sure to add the corresponding flags when compiling!), you can find the "CUDA Memory Transactions" button at the very bottom of the lower menu. In that view, you want to find that there is no memory transaction coming from the lines that are only working on the corresponding array (e.g. the CompareAndSwap lines in my code example). Because if it does not report any memory access for those lines, you (very probably) were able to keep the entire computation in registers and might just have gained a speed up of thousands, if not tenthousands, of percent (You might also want to check the actual speed gain, you get out of this!).

Figuring out Compute Capability

In order to figure out how many registers you have, you need to know your device's compute capability. The standard way of getting such device information is running the device query sample. For CUDA 5.5 on Windows 64bit, that is by default located at C:\ProgramData\NVIDIA Corporation\CUDA Samples\v5.5\Bin\win64\Release\deviceQuery.exe (On Windows the console window will close immediately, you might want to open cmd first and run it from there). It has a similar location on Linux and MAC.

If you have Nsight for Visual Studio, just go to Nsight -> Windows -> System Info.

Don't optimize early

I am sharing this today because I came across this very problem very recently. However, as mentioned in this thread, forcing data to be in registers is definitely not the first step you want to take. First, make sure that you actually understand what is going on, then approach the problem step by step. Looking at the assembly code is certainly a good step, but it should generally not be your first. If you are new to CUDA, the CUDA Best Practices Guide will help you figure out some of those steps.

这篇关于在CUDA内核,我如何存储和QUOT阵列;本地线程内存和QUOT ;?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆