CUDA:共享内存且没有并行性时性能较差 [英] CUDA: bad performance with shared memory and no parallelism

查看:110
本文介绍了CUDA:共享内存且没有并行性时性能较差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在此内核功能中利用共享内存,但是性能却不如我预期的那样.在我的应用程序中多次调用此函数(大约1000次或更多),因此我在考虑利用共享内存来避免内存延迟.但是显然有问题,因为自从我使用共享内存以来,我的应用程序变得非常慢.
这是内核:

I'm trying to exploit shared memory in this kernel function, but the performance are not as good as I was expecting. This function is called many times in my application (about 1000 times or more), so I was thinking to exploit shared memory to avoid the memory latency. But something is wrong apparently because my application became really slow since i'm using shared memory.
This is the kernel:

__global__ void AndBitwiseOperation(int* _memory_device, int b1_size, int* b1_memory, int* b2_memory){
int j = 0;

// index GPU - Transaction-wise
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int tid = threadIdx.x;

// shared variable
extern __shared__ int shared_memory_data[];
extern __shared__ int shared_b1_data[];
extern __shared__ int shared_b2_data[];

// copy from global memory into shared memory and sync threads
shared_b1_data[tid] = b1_memory[tid];
shared_b2_data[tid] = b2_memory[tid];
__syncthreads();

// AND each int bitwise
for(j = 0; j < b1_size; j++)
    shared_memory_data[tid] = (shared_b1_data[tid] & shared_b2_data[tid]);

// write result for this block to global memory
_memory_device[i] = shared_memory_data[i];
}

共享变量被声明为 extern ,因为我不知道b1和b2的大小,因为它们取决于我只能在运行时知道的客户数量(但是两者的大小相同)所有的时间).
这就是我所说的内核:

The shared variables are declared extern because I don't know the size of b1 and b2 since they depend from the number of customer that I can only know at runtime (but both have the same size all the times).
This is how I call the kernel:

void Bitmap::And(const Bitmap &b1, const Bitmap &b2)
{

int* _memory_device;
int* b1_memory;
int* b2_memory;

int b1_size = b1.getIntSize();

// allocate memory on GPU
(cudaMalloc((void **)&b1_memory,  _memSizeInt * SIZE_UINT));
(cudaMalloc((void **)&b2_memory,  _memSizeInt * SIZE_UINT));
(cudaMalloc((void **)&_memory_device,  _memSizeInt * SIZE_UINT));

// copy values on GPU
(cudaMemcpy(b1_memory, b1._memory, _memSizeInt * SIZE_UINT, cudaMemcpyHostToDevice ));
(cudaMemcpy(b2_memory, b2._memory, _memSizeInt * SIZE_UINT, cudaMemcpyHostToDevice ));
(cudaMemcpy(_memory_device, _memory, _memSizeInt * SIZE_UINT, cudaMemcpyHostToDevice ));

dim3 dimBlock(1, 1);
dim3 dimGrid(1, 1);

AndBitwiseOperation<<<dimGrid, dimBlock>>>(_memory_device, b1_size, b1_memory, b2_memory);

// return values
(cudaMemcpy(_memory, _memory_device, _memSizeInt * SIZE_UINT, cudaMemcpyDeviceToHost ));

// Free Memory
(cudaFree(b1_memory));
(cudaFree(b2_memory));
(cudaFree(_memory_device));
}

b1和b2是每个元素具有4位的位图.元素的数量取决于客户的数量.另外,我对内核的参数有疑问,因为如果我添加一些块或线程,AndBitwiseOperation()不会给我正确的结果.仅1个块,每个块1个线程,结果是正确的,但内核不是并行的.
欢迎任何建议:)
谢谢

b1 and b2 are bitmaps with 4 bits for each element. The number of elements depend from the number of customers. Also, I have problem with the kernel's parameters, because if I add some blocks or threads, the AndBitwiseOperation() is not giving me the correct result. With just 1 block and 1 thread per block the result is correct but the kernel is not in parallel.
Every advice is welcomed :)
Thank you

推荐答案

我真的不了解您的内核想要做什么.

I did not really understood what your kernel wants to do.

您应该阅读有关CUDA和GPU编程的更多信息.

我试图指出一些错误.

  1. 共享内存(sm)应该减少全局内存读取. 分析每个线程的全局内存(gm)读写操作.

  1. Shared memory (sm) should reduce global memory reads. Analyze your global memory (gm) read and write operations per thread.

a.您两次读取全局内存,并两次写入sm
b. (忽略废话循环,不使用索引),您读了两次sn,然后写了一次sm
C.你读了一次sm,写了一次gm

a. You read global memory two times and write sm two times
b. (nonsense loop ignored, no use of index) you read two times sn and write once sm
c. you read once sm and write once gm

因此,您一无所获.您可以直接使用全局内存.

So in total you have nothing gained. You could directly use the global memory.

您使用所有线程在块索引"i"中写出一个值. 您只应使用一个线程将这些数据写出.
通过多个将序列化的线程输出相同的数据是没有意义的.

You use all threads to write out one value at the block index "i". You should only use one thread to write this data out.
It makes no sense outputing the same data by multiple threads which will get serialized.

您使用循环,根本不用循环计数器.

You use a loop and don't use the loop counter at all.

您在"tid"处写作,在"i"处随机阅读.

You write at "tid" and read at "i" randomly.

此分配是开销.

unsigned int tid = threadIdx.x;

  • 一个以上的块不能得到正确的结果,因为一个块tid = i!
    所有错误的索引导致使用不止一个块的错误计算

  • The results cannot be correct with more then one block since with one block tid = i!
    All the wrong indexing results in wrong calculation using more then one block

    从未写入"i"上的共享内存!

    The shared memory at "i" was never written!

    _memory_device[i] = shared_memory_data[i];
    

  • 我的假设是您的内核应该做什么

    /*
     * Call kernel with x-block usage and up to 3D Grid
     */
    __global__ void bitwiseAnd(int* outData_g, 
        const long long int inSize_s, 
        const int* inData1_g, 
        const int* inData2_g)
    {
        //get unique block index
        const unsigned long long int blockId = blockIdx.x //1D
            + blockIdx.y * gridDim.x //2D
            + gridDim.x * gridDim.y * blockIdx.z; //3D
    
        //get unique thread index
        const unsigned long long int threadId = blockId * blockDim.x + threadIdx.x; 
    
        //check global unique thread range
        if(threadId >= inSize_s)
            return;
    
        //output bitwise and
        outData_g[thread] = inData1_g[thread] & inData2_g[thread];
    }
    

    这篇关于CUDA:共享内存且没有并行性时性能较差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆