在CUDA中实现与可变长度(本地)数组等效 [英] Achieving the equivalent of a variable-length (local) array in CUDA

查看:180
本文介绍了在CUDA中实现与可变长度(本地)数组等效的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些使用本地内存的代码(我可能使用过寄存器,但是我需要动态寻址).由于我使用的内存量取决于输入和块中的线程数(在运行时,尽管在启动前,但也取决于输入)-它不能是固定大小的数组.另一方面,我不会写

I have some code which uses local memory (I might have used registers, but I need dynamic addressing). Since the amount of memory I use depends on the input and on the number of threads in the block (which also depends on the input, at run-time, although before launch-time) - it can't be a fixed-size array. On the other hand, I can't write

__global__ foo(short x)
{
    int my_local_mem_array[x];
}

(这是有效但有问题的C99 ,但如何获得相同的效果?到目前为止,我的想法一直是根据内存大小对内核进行模板化,并仅在我需要的情况下使用内核上可能的最大适当L1内存来调用它.但这有点丑陋,因为这意味着我将不得不将实例化数量乘以不同的可能的最大内存大小. gh.

How can I achieve the same effect? So far my thought has been to template the kernel on the memory size and just invoke it with the maximum possible proper-L1 memory on a core, using only as much as I need. But that's kind of ugly, since it would mean I would have to multiply the number of instantiations by the different possible maximum memory sizes. Ugh.

推荐答案

我认为模板元编程可能是做您想要的事情的唯一现实方法(为什么要为什么这样做不是很明显,但这是另一个问题.鉴于本地内存需要静态编译作为每个线程堆栈帧的一部分,因此我没有其他方法可以声明一个可变"长度的本地内存数组.

I think template metaprogramming is probably the only realistic way of doing what it seems you want (the rationale for why you actually want to do this isn't very obvious, but that is another question). There isn't any other way I am aware of for declaring a "variable" length local memory array, given that local memory requires static compilation as part of the per thread stack frame.

当然,实例化和选择同一模板功能的许多不同版本并不有趣,但是您可以使用

Of course, instantiating and selecting many different versions of the same template function isn't much fun, but you can use something like boost preprocessor to automate all the tedium.

例如,考虑以下简单内核,该内核看起来很像您在问题中描述的模型:

For example, consider the following simple kernel which looks a lot like the model you describe in your question:

#include <boost/preprocessor/arithmetic/inc.hpp>
#include <boost/preprocessor/comparison/not_equal.hpp>
#include <boost/preprocessor/repetition/for.hpp>
#include <boost/preprocessor/tuple/elem.hpp>

template<int N>
__global__ void kernel(int *out, int Nout)
{
    int scratch[N];
    for(int i=0; i<N; i++)
        scratch[i] = i - Nout;

    if (Nout > 1) {
       out[threadIdx.x] = scratch[Nout];
    }
}

#define PRED(r, state) \
   BOOST_PP_NOT_EQUAL( \
      BOOST_PP_TUPLE_ELEM(2, 0, state), \
      BOOST_PP_INC(BOOST_PP_TUPLE_ELEM(2, 1, state)) \
   ) \
   /**/

#define OP(r, state) \
   ( \
      BOOST_PP_INC(BOOST_PP_TUPLE_ELEM(2, 0, state)), \
      BOOST_PP_TUPLE_ELEM(2, 1, state) \
   ) \
   /**/

#define STUB(n) template __global__ void kernel<n>(int *, int);
#define MACRO(r, state) STUB(BOOST_PP_TUPLE_ELEM(2, 0, state));

BOOST_PP_FOR((10, 20), PRED, OP, MACRO) // generate kernel<10> ... kernel<20>

在这里,我已使用BOOST_PP_FOR自动生成10个不同的基本内核实例:

Here I have used BOOST_PP_FOR to generate 10 different instances of the basic kernel automagically:

>nvcc -arch=sm_21 -cubin -Xptxas="-v" -I ..\boost_1_60_0 template.cu

template.cu
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z6kernelILi13EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi13EEvPii
    56 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi17EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi17EEvPii
    72 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi15EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi15EEvPii
    64 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi19EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi19EEvPii
    80 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi11EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi11EEvPii
    48 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi16EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi16EEvPii
    64 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi20EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi20EEvPii
    80 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi12EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi12EEvPii
    48 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi14EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi14EEvPii
    56 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi18EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi18EEvPii
    72 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi10EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi10EEvPii
    40 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]

您还可以使用相同的自动化方法来生成主机包装函数,该函数在运行时选择正确的实例.虽然不是理想的选择,但它可移植,快速且可与CUDA工具链配合使用.

You can also use the same automation to generate a host wrapper function which selects the correct instance at runtime. While it isn't ideal, it is portable, fast and works fine with the CUDA toolchain.

这篇关于在CUDA中实现与可变长度(本地)数组等效的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆