在CUDA中有效地初始化共享内存数组 [英] Efficiently Initializing Shared Memory Array in CUDA

查看:2035
本文介绍了在CUDA中有效地初始化共享内存数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

注意,这个共享内存数组不会被写入,只能从中读取。



像我一样,我的共享内存被初始化为:

  __ shared__ float TMshared [2592]; 
for(int i = 0; i <2592; i ++)
{
TMshared [i] = TM [i]

}
__syncthreads();

(TM从内核启动传递到所有线程)



您可能已经注意到,这是非常低效的,因为没有并行化,同一块中的线程正在写入相同的位置。



有人请推荐一个更有效的方法/评论如果这个问题真的需要优化,因为共享数组有问题是相对较小?



谢谢!

$ b $



示例假设1D是一个单独的单元格, threadblock / grid:

  #define SSIZE 2592 

__shared__ float TMshared [SSIZE];

int lidx = threadIdx.x;
while(lidx< SSIZE){
TMShared [lidx] = TM [lidx];
lidx + = blockDim.x;}

__syncthreads();


Note that this shared memory array is never written to, only read from.

As I have it, my shared memory gets initialized like:

__shared__ float TMshared[2592]; 
for (int i = 0; i< 2592; i++)
{
TMshared[i] = TM[i];

}
__syncthreads();

(TM is passed into all threads from kernel launch)

You might have noticed that this is highly inefficient as there is no parallelization going on and threads within the same block are writing to the same location.

Can someone please recommend a more efficient approach/comment on if this issue really needs optimization since the shared array in question is relatively small?

Thanks!

解决方案

Use all threads to write independent locations, it will probably be quicker.

Example assumes 1D threadblock/grid:

#define SSIZE 2592

__shared__ float TMshared[SSIZE]; 

  int lidx = threadIdx.x;
  while (lidx < SSIZE){
    TMShared[lidx] = TM[lidx];
    lidx += blockDim.x;}

__syncthreads();

这篇关于在CUDA中有效地初始化共享内存数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆