使用动态分配和静态分配的共享内存 [英] Using both dynamically-allocated and statically-allocated shared memory
问题描述
假设我有两个 __ device __
CUDA函数,每个都有以下局部变量:
code> __ shared__ int a [123];
和另一个函数(说这是我的内核,即 __ global __
function),具有:
extern __shared__ int b [];
这是明确允许/禁止nVIDIA吗? (我在编程指南 __共享__
中的第B.2.3节)将大小全部一起计入共享内存限制,还是一次只使用最大值?还是其他规则?
这可以被视为后续问题 共享内存分为两部分:静态分配并动态分配。第一部分在编译期间计算,并且每个声明是实际分配 - 在编译期间激活ptxas信息在这里示出: code> 384 字节, 你可以从Kepler传递一个指向共享内存的指针到另一个允许设备子功能访问另一个共享内存声明的函数。 / p> 然后,动态分配的共享内存,保留大小在内核调用期间声明。 下面是几个函数中的一些不同用法的示例。请注意每个共享内存区域的指针值。 此代码使用 当使用动态共享内存到 EDIT: 内存的共享空间在PTX文档中定义此处: 共享(.shared)状态空间是每个CTA内存区域CTA中的线程来共享数据。共享存储器中的地址可以由CTA中的任何线程读取和写入。使用ld.shared和st.shared访问共享变量。 虽然没有运行时的详细信息。编程指南中有一个字词此处没有关于两者的混合的进一步细节。 在PTX编译期间,编译器可能知道静态分配的共享内存量。可能有一些补充魔法。看着SASS,第一条指令使用SR_LMEMHIOFF 并以相反顺序调用函数为静态分配的共享内存分配不同的值一种形式的stackalloc)。 我相信ptxas编译器会计算所有的共享内存,在最坏的情况下,当所有的方法都可以被调用时(当不使用方法和使用函数指针, einpoklum在评论中建议,这是实验性的,不是规范/ API定义的一部分。 Suppose I have two and another function (say it's my kernel, i.e. a Is this explicitly allowed/forbidden by nVIDIA? (I don't see it in the programming guide section B.2.3 on This can be considered a follow-up question to this one. The shared memory is split in two parts: statically allocated and dynamically allocated. The first part is calculated during compilation, and each declaration is an actual allocation - activating ptxas info during compilation illustrates it here: Here, we have You may pass a pointer to shared memory since Kepler, to another function allowing a device sub-function to access another shared memory declaration. Then, comes the dynamically allocated shared memory, which reserved size is declared during kernel call. Here is an example of some various uses in a couple of functions. Note the pointer value of each shared memory region. This code outputs the ptxas info using When running the kernel with dynamic shared memory equals to EDIT:
The ptx file generated by this code holds declarations of statically-defined shared memory, except for the entry-point where it is defined in the body of the method The shared space of the memory is defined in the PTX documentation here: The shared (.shared) state space is a per-CTA region of memory for threads in a CTA to share data. An address in shared memory can be read and written by any thread in a CTA. Use ld.shared and st.shared to access shared variables. Though with no detail on the runtime. There is a word in the programming guide here with no further detail on the mixing of the two. During PTX compilation, the compiler may know the amount of shared memory that is statically allocated. There might be some supplemental magic. Looking at the SASS, the first instructions use the SR_LMEMHIOFF and calling functions in reverse order assign different values to the statically-allocated shared memory (looks very much like a form of stackalloc). I believe the ptxas compiler calculates all the shared memory it might need in the worst case when all method may be called (when not using one of the method and using function pointers, the Finally, as einpoklum suggests in a comment, this is experimental and not part of a norm/API definition. 这篇关于使用动态分配和静态分配的共享内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
ptxas info:使用22个寄存器,384个字节smem,48个字节cmem [0]
3
数组 32
(参见下面的例子)。
__ device__ void dev1()
{
__shared__ int a [32];
a [threadIdx.x] = threadIdx.x;
if(threadIdx.x == 0)
printf(dev1:%x\\\
,a);
}
__device__ void dev2()
{
__shared__ int a [32];
a [threadIdx.x] = threadIdx.x * 5;
if(threadIdx.x == 0)
printf(dev2:%x\\\
,a);
}
__global__ void kernel(int * res,int * res2)
{
__shared__ int a [32]
extern __shared__ int b [];
a [threadIdx.x] = 0;
b [threadIdx.x] = threadIdx.x * 3;
dev1();
__syncthreads();
dev2();
__syncthreads();
res [threadIdx.x] = a [threadIdx.x];
res2 [threadIdx.x] = b [threadIdx.x];
if(threadIdx.x == 0)
printf(global a:%x\\\
,a);
if(threadIdx.x == 0)
printf(global b:%x \\\
,b);
}
int main()
{
int * dres;
int * dres2;
cudaMalloc<> (& dres,32 * sizeof(int));
cudaMalloc<> (& dres2,32 * sizeof(int));
kernel<<< 1,32,32 * sizeof(float)>>> (dres,dres2);
int hres [32];
int hres2 [32];
cudaMemcpy(hres,dres,32 * sizeof(int),cudaMemcpyDeviceToHost);
cudaMemcpy(hres2,dres2,32 * sizeof(int),cudaMemcpyDeviceToHost);
for(int k = 0; k <32; ++ k)
{
printf(%d - %d \\\
,hres [k ],hres2 [k]);
}
return 0;
}
384字节smem输出ptxas信息,这是一个数组,用于全局
a
数组,第二个用于dev1方法 a
,第三个是dev2方法 a
数组。总计 3 * 32 * sizeof(float)= 384字节
。
32 * sizeof(float)
,指向 b
的指针会在这三个数组之后开始。
此代码生成的ptx文件包含静态定义的共享内存的声明,
.shared .align 4 .b8 _ZZ4dev1vE1a [128];
.shared .align 4 .b8 _ZZ4dev2vE1a [128];
.extern .shared .align 4 .b8 b [];除了在方法正文中定义的入口点之外,
// _ZZ6kernelPiS_E1a已降级
$ b b
1 IADD32I R1,R1,-0x8;
2 S2R R0,SR_LMEMHIOFF;
3 ISETP.GE.U32.AND P0,PT,R1,R0,PT;
b
地址不会改变,并且从未访问未分配的共享内存区域)。
__device__
CUDA function, each having the following local variable:__shared__ int a[123];
__global__
function), with:extern __shared__ int b[];
__shared__
) Do the sizes all count together together towards the shared memory limit, or is it the maximum possibly in use at a single time? Or some other rule? ptxas info : Used 22 registers, 384 bytes smem, 48 bytes cmem[0]
384
bytes, which is 3
arrays of 32
ints. (see sample corde below).__device__ void dev1()
{
__shared__ int a[32] ;
a[threadIdx.x] = threadIdx.x ;
if (threadIdx.x == 0)
printf ("dev1 : %x\n", a) ;
}
__device__ void dev2()
{
__shared__ int a[32] ;
a[threadIdx.x] = threadIdx.x * 5 ;
if (threadIdx.x == 0)
printf ("dev2 : %x\n", a) ;
}
__global__ void kernel(int* res, int* res2)
{
__shared__ int a[32] ;
extern __shared__ int b[];
a[threadIdx.x] = 0 ;
b[threadIdx.x] = threadIdx.x * 3 ;
dev1();
__syncthreads();
dev2();
__syncthreads();
res[threadIdx.x] = a[threadIdx.x] ;
res2[threadIdx.x] = b[threadIdx.x] ;
if (threadIdx.x == 0)
printf ("global a : %x\n", a) ;
if (threadIdx.x == 0)
printf ("global b : %x\n", b) ;
}
int main()
{
int* dres ;
int* dres2 ;
cudaMalloc <> (&dres, 32*sizeof(int)) ;
cudaMalloc <> (&dres2, 32*sizeof(int)) ;
kernel<<<1,32,32*sizeof(float)>>> (dres, dres2);
int hres[32] ;
int hres2[32] ;
cudaMemcpy (hres, dres, 32 * sizeof(int), cudaMemcpyDeviceToHost) ;
cudaMemcpy (hres2, dres2, 32 * sizeof(int), cudaMemcpyDeviceToHost) ;
for (int k = 0 ; k < 32 ; ++k)
{
printf ("%d -- %d \n", hres[k], hres2[k]) ;
}
return 0 ;
}
384 bytes smem
, that is one array for global a
array, a second for dev1 method a
array, and a third for dev2 method a
array. Totalling 3*32*sizeof(float)=384 bytes
.32*sizeof(float)
, the pointer to b
starts right after these three arrays..shared .align 4 .b8 _ZZ4dev1vE1a[128];
.shared .align 4 .b8 _ZZ4dev2vE1a[128];
.extern .shared .align 4 .b8 b[];
// _ZZ6kernelPiS_E1a has been demoted
1 IADD32I R1, R1, -0x8;
2 S2R R0, SR_LMEMHIOFF;
3 ISETP.GE.U32.AND P0, PT, R1, R0, PT;
b
address does not change, and the unallocated shared memory region is never accessed).