Cuda 共享内存数组变量 [英] Cuda Shared Memory array variable

查看:13
本文介绍了Cuda 共享内存数组变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试为矩阵乘法声明一个变量,如下所示:

I am trying to declare a variable for matrix multiplication as follows:

__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

我正在尝试这样做,以便用户可以输入要计算的矩阵的大小,但这意味着更改 BLOCK_SIZE.我更改了它,但出现编译器错误:错误:常量值未知".我已经调查过了,它类似于这个 线程.所以我尝试了:

I am trying to make it so the user could input the size of the matrix to calculate, however that would mean changing the BLOCK_SIZE. I changed it but I am getting a compiler error:"error: constant value is not known". I've looked into it and it's similar to this thread. So I tried:

__shared__ int buf [];

然后我得到:错误:不允许不完整的类型"

But then I get: "error: incomplete type is not allowed"

谢谢,担更新代码(几乎遵循 本指南和凝视cuda指南):通过询问用户矩阵的大小来传入块大小.他们输入 x 和 y.块大小只有 x,现在它必须接受与 x 和 y 相同的大小.

Thanks, Dan Update with code(pretty much followed this guide and the staring out with cuda guide): The block size is passed in by asking the user of the size of the matrix. They enter the x and y. Block size is only x and right now it has to accept the same size as x and y.

__global__ void matrixMul( float* C, float* A, float* B, int wA, int wB,size_t block_size)
{
    // Block index
    int bx = blockIdx.x;
    int by = blockIdx.y;

    // Thread index
    int tx = threadIdx.x;
    int ty = threadIdx.y;

    // Index of the first sub-matrix of A processed 
    // by the block
    int aBegin = wA * block_size * by;

    // Index of the last sub-matrix of A processed 
    // by the block
    int aEnd   = aBegin + wA - 1;

    // Step size used to iterate through the 
    // sub-matrices of A
    int aStep  = block_size;

    // Index of the first sub-matrix of B processed 
    // by the block
    int bBegin = block_size * bx;

    // Step size used to iterate through the 
    // sub-matrices of B
    int bStep  = block_size * wB;
    float Csub=0;
    // Loop over all the sub-matrices of A and B
    // required to compute the block sub-matrix
    for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) 
    {
        // Declaration of the shared memory array As 
        // used to store the sub-matrix of A

        extern __shared__ float As[];

        // Declaration of the shared memory array Bs 
        // used to store the sub-matrix of B
        extern __shared__ float Bs[];
        extern __shared__ float smem[];

        // Load the matrices from global memory
        // to shared memory; each thread loads
        // one element of each matrix
        smem[ty*block_size+tx] = A[a + wA * ty + tx];
        //cuPrintf("

What are the memory locations?
");
        //cuPrintf("The shared memory(A) is: %.2f
",smem[ty*block_size+tx]);
        smem[block_size*block_size+ty*block_size+tx]  = B[b + wB * ty + tx];
        //cuPrintf("The shared memory(B) is: %.2f
",smem[block_size*block_size+ty*block_size+tx]);
        // Synchronize to make sure the matrices 
        // are loaded
        __syncthreads();

        // Multiply the two matrices together;
        // each thread computes one element
        // of the block sub-matrix
        for (int k = 0; k < block_size; ++k)
        {

            Csub += smem[ty*block_size+k] * smem[block_size*block_size+k*block_size+tx] ;
            //cuPrintf("Csub is currently: %.2f
",Csub);
        }
        //cuPrintf("


");
        // Synchronize to make sure that the preceding
        // computation is done before loading two new
        // sub-matrices of A and B in the next iteration
        //cuPrintf("the results are csub: %.2f
",Csub);
        __syncthreads();
    }
    // Write the block sub-matrix to device memory;
    // each thread writes one element
    int c = wB * block_size * by + block_size * bx;
    C[c + wB * ty + tx] = Csub;


}

推荐答案

extern __shared__ int buf[];

当你启动内核时,你应该以这种方式启动它;

when you launch the kernel you should launch it this way;

内核<<<blocks,threads,numbytes_for_shared>>>(...);

如果你有多个共享的外部声明:

If you have multiple extern declaration of shared:

extern __shared__ float As[];

extern __shared__ float Bs[];

这将导致 As 指向与 Bs 相同的地址.

this will lead to As pointing to the same address as Bs.

您需要将 As 和 Bs 保留在一维数组中.

You will need to keep As and Bs inside the 1D-array.

extern __shared__ float smem[];

调用内核时,应使用2*BLOCK_SIZE*BLOCK_SIZE*sizeof(float)启动它.

When calling kernel, you should launch it with 2*BLOCK_SIZE*BLOCK_SIZE*sizeof(float).

当索引到 As 时,使用 smem[y*BLOCK_SIZE+x],当索引到 Bs 时,使用 smem[BLOCK_SIZE*BLOCK_SIZE+y*BLOCK_SIZE+x]

When indexing into As, use smem[y*BLOCK_SIZE+x] and when indexing into Bs use smem[BLOCK_SIZE*BLOCK_SIZE+y*BLOCK_SIZE+x]

这篇关于Cuda 共享内存数组变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆