使用动态分配和静态分配的共享内存 [英] Using both dynamically-allocated and statically-allocated shared memory

查看:321
本文介绍了使用动态分配和静态分配的共享内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有两个 __ device __ CUDA函数,每个都有以下局部变量:

 code> __ shared__ int a [123]; 

和另一个函数(说这是我的内核,即 __ global __ function),具有:

  extern __shared__ int b []; 

这是明确允许/禁止nVIDIA吗? (我在编程指南 __共享__ 中的第B.2.3节)将大小全部一起计入共享内存限制,还是一次只使用最大值?还是其他规则?



这可以被视为后续问题

共享内存分为两部分:静态分配并动态分配。第一部分在编译期间计算,并且每个声明是实际分配 - 在编译期间激活ptxas信息在这里示出:

  ptxas info:使用22个寄存器,384个字节smem,48个字节cmem [0] 

code> 384 字节, 3 数组 32 (参见下面的例子)。



你可以从Kepler传递一个指向共享内存的指针到另一个允许设备子功能访问另一个共享内存声明的函数。 / p>

然后,动态分配的共享内存,保留大小在内核调用期间声明。



下面是几个函数中的一些不同用法的示例。请注意每个共享内存区域的指针值。

  __ device__ void dev1()
{
__shared__ int a [32];
a [threadIdx.x] = threadIdx.x;

if(threadIdx.x == 0)
printf(dev1:%x\\\
,a);
}

__device__ void dev2()
{
__shared__ int a [32];
a [threadIdx.x] = threadIdx.x * 5;

if(threadIdx.x == 0)
printf(dev2:%x\\\
,a);
}

__global__ void kernel(int * res,int * res2)
{
__shared__ int a [32]
extern __shared__ int b [];

a [threadIdx.x] = 0;
b [threadIdx.x] = threadIdx.x * 3;

dev1();
__syncthreads();
dev2();
__syncthreads();

res [threadIdx.x] = a [threadIdx.x];
res2 [threadIdx.x] = b [threadIdx.x];

if(threadIdx.x == 0)
printf(global a:%x\\\
,a);
if(threadIdx.x == 0)
printf(global b:%x \\\
,b);
}

int main()
{
int * dres;
int * dres2;

cudaMalloc<> (& dres,32 * sizeof(int));
cudaMalloc<> (& dres2,32 * sizeof(int));

kernel<<< 1,32,32 * sizeof(float)>>> (dres,dres2);

int hres [32];
int hres2 [32];

cudaMemcpy(hres,dres,32 * sizeof(int),cudaMemcpyDeviceToHost);
cudaMemcpy(hres2,dres2,32 * sizeof(int),cudaMemcpyDeviceToHost);

for(int k = 0; k <32; ++ k)
{
printf(%d - %d \\\
,hres [k ],hres2 [k]);
}
return 0;
}

此代码使用 384字节smem输出ptxas信息,这是一个数组,用于全局 a 数组,第二个用于dev1方法 a ,第三个是dev2方法 a 数组。总计 3 * 32 * sizeof(float)= 384字节



当使用动态共享内存到 32 * sizeof(float),指向 b 的指针会在这三个数组之后开始。



EDIT:
此代码生成的ptx文件包含静态定义的共享内存的声明,

  .shared .align 4 .b8 _ZZ4dev1vE1a [128]; 
.shared .align 4 .b8 _ZZ4dev2vE1a [128];
.extern .shared .align 4 .b8 b [];除了在方法正文中定义的入口点之外,

  // _ZZ6kernelPiS_E1a已降级


$ b b

内存的共享空间在PTX文档中定义此处


共享(.shared)状态空间是每个CTA内存区域CTA中的线程来共享数据。共享存储器中的地址可以由CTA中的任何线程读取和写入。使用ld.shared和st.shared访问共享变量。


虽然没有运行时的详细信息。编程指南中有一个字词此处没有关于两者的混合的进一步细节。



在PTX编译期间,编译器可能知道静态分配的共享内存量。可能有一些补充魔法。看着SASS,第一条指令使用SR_LMEMHIOFF

  1 IADD32I R1,R1,-0x8; 
2 S2R R0,SR_LMEMHIOFF;
3 ISETP.GE.U32.AND P0,PT,R1,R0,PT;

并以相反顺序调用函数为静态分配的共享内存分配不同的值一种形式的stackalloc)。



我相信ptxas编译器会计算所有的共享内存,在最坏的情况下,当所有的方法都可以被调用时(当不使用方法和使用函数指针, b 地址不会改变,并且从未访问未分配的共享内存区域)。



einpoklum在评论中建议,这是实验性的,不是规范/ API定义的一部分。


Suppose I have two __device__ CUDA function, each having the following local variable:

__shared__ int a[123];

and another function (say it's my kernel, i.e. a __global__ function), with:

extern __shared__ int b[];

Is this explicitly allowed/forbidden by nVIDIA? (I don't see it in the programming guide section B.2.3 on __shared__) Do the sizes all count together together towards the shared memory limit, or is it the maximum possibly in use at a single time? Or some other rule?

This can be considered a follow-up question to this one.

解决方案

The shared memory is split in two parts: statically allocated and dynamically allocated. The first part is calculated during compilation, and each declaration is an actual allocation - activating ptxas info during compilation illustrates it here:

  ptxas info    : Used 22 registers, 384 bytes smem, 48 bytes cmem[0]

Here, we have 384 bytes, which is 3 arrays of 32 ints. (see sample corde below).

You may pass a pointer to shared memory since Kepler, to another function allowing a device sub-function to access another shared memory declaration.

Then, comes the dynamically allocated shared memory, which reserved size is declared during kernel call.

Here is an example of some various uses in a couple of functions. Note the pointer value of each shared memory region.

__device__ void dev1()
{
    __shared__ int a[32] ;
    a[threadIdx.x] = threadIdx.x ;

    if (threadIdx.x == 0)
        printf ("dev1 : %x\n", a) ;
}

__device__ void dev2()
{
    __shared__ int a[32] ;
    a[threadIdx.x] = threadIdx.x * 5 ;

    if (threadIdx.x == 0)
        printf ("dev2 : %x\n", a) ;
}

__global__ void kernel(int* res, int* res2)
{
    __shared__ int a[32] ;
    extern __shared__ int b[];

    a[threadIdx.x] = 0 ;
    b[threadIdx.x] = threadIdx.x * 3 ;

    dev1();
    __syncthreads();
    dev2();
    __syncthreads();

    res[threadIdx.x] = a[threadIdx.x] ;
    res2[threadIdx.x] = b[threadIdx.x] ;

    if (threadIdx.x == 0)
        printf ("global a : %x\n", a) ;
    if (threadIdx.x == 0)
        printf ("global b : %x\n", b) ;
}

int main()
{
    int* dres  ;
    int* dres2 ;

    cudaMalloc <> (&dres, 32*sizeof(int)) ;
    cudaMalloc <> (&dres2, 32*sizeof(int)) ;

    kernel<<<1,32,32*sizeof(float)>>> (dres, dres2);

    int hres[32] ;
    int hres2[32] ;

    cudaMemcpy (hres, dres, 32 * sizeof(int), cudaMemcpyDeviceToHost) ;
    cudaMemcpy (hres2, dres2, 32 * sizeof(int), cudaMemcpyDeviceToHost) ;

    for (int k = 0 ; k < 32 ; ++k)
    {
        printf ("%d -- %d \n", hres[k], hres2[k]) ;
    }
    return 0 ;
}

This code outputs the ptxas info using 384 bytes smem, that is one array for global a array, a second for dev1 method a array, and a third for dev2 method a array. Totalling 3*32*sizeof(float)=384 bytes.

When running the kernel with dynamic shared memory equals to 32*sizeof(float), the pointer to b starts right after these three arrays.

EDIT: The ptx file generated by this code holds declarations of statically-defined shared memory,

.shared .align 4 .b8 _ZZ4dev1vE1a[128];
.shared .align 4 .b8 _ZZ4dev2vE1a[128];
.extern .shared .align 4 .b8 b[];

except for the entry-point where it is defined in the body of the method

// _ZZ6kernelPiS_E1a has been demoted

The shared space of the memory is defined in the PTX documentation here:

The shared (.shared) state space is a per-CTA region of memory for threads in a CTA to share data. An address in shared memory can be read and written by any thread in a CTA. Use ld.shared and st.shared to access shared variables.

Though with no detail on the runtime. There is a word in the programming guide here with no further detail on the mixing of the two.

During PTX compilation, the compiler may know the amount of shared memory that is statically allocated. There might be some supplemental magic. Looking at the SASS, the first instructions use the SR_LMEMHIOFF

1             IADD32I R1, R1, -0x8;
2             S2R R0, SR_LMEMHIOFF;
3             ISETP.GE.U32.AND P0, PT, R1, R0, PT;

and calling functions in reverse order assign different values to the statically-allocated shared memory (looks very much like a form of stackalloc).

I believe the ptxas compiler calculates all the shared memory it might need in the worst case when all method may be called (when not using one of the method and using function pointers, the b address does not change, and the unallocated shared memory region is never accessed).

Finally, as einpoklum suggests in a comment, this is experimental and not part of a norm/API definition.

这篇关于使用动态分配和静态分配的共享内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆