全局对共享内存在CUDA [英] global vs shared memory in CUDA

查看:150
本文介绍了全局对共享内存在CUDA的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个CUDA内核,计算类似的东西。一个是使用全局内存( myfun 是一个从全局内存中读取很多内容并进行计算的设备函数)。第二内核将该数据块从全局存储器传送到共享存储器,使得数据可在块的不同线程之间共享。我使用全局内存的内核比拥有共享内存的内核快得多。可能的原因是什么?

I have two CUDA kernels that compute similar stuff. One is using global memory (myfun is a device function that reads a lot from global memory and do the computation). The second kernel transfers that chunk of data from global memory to shared memory so the data can be shared among different threads of a block. My kernel that uses global memory is much faster than the one with shared memory. What are the possible reasons?

loadArray只是将 d_x 的一小部分复制到 m

loadArray just copies a small part of d_x to m.

__global__ void mykernel(float *d_x, float *d_y, int *d_z, float *d_u, int N, int K, int D)
{

  int tid = blockIdx.x*blockDim.x + threadIdx.x;
  int index = 0;
  float max_s = 1e+37F;


  if (tid < N)
    {

      for (int i = 0; i < K; i++)
        {

          float s = myfun(&d_x[i*D], d_y, tid);

          if (s > max_s)
            {
              max_s = s;
              index = i;
            }
        }

      d_z[tid] = index;
      d_u[tid] = max_s;
    }
}

使用共享内存:

__global__ void mykernel(float *d_x, float *d_y, int *d_z, float *d_u, int N, int K)
{
  int tid = blockIdx.x*blockDim.x + threadIdx.x;
  int index = 0;
  float max_s = 1e+37F;

  extern __shared__ float m[];
  if( threadIdx.x == 0 )
    loadArray( m, d_x );
  __syncthreads();

  if (tid < N)
    {

      for (int i = 0; i < K; i++)
        {

          float s = myfun(m, d_y, tid);

          if (s > max_s)
            {
              max_s = s;
              index = i;
            }
        }

      d_z[tid] = index;
      d_u[tid] = max_s;
    }
}


推荐答案

问题是只有每个块中的第一个线程正在从全局内存读入共享内存,这比让所有线程同时从全局内存中读取要慢得多。

The problem is that only the first thread in each block is reading from global memory into shared memory, this is much slower than letting all threads read from global memory simultaneously.

使用共享当单个线程需要从全局内存访问相邻元素时,内存是一个优势 - 但是这里似乎不是这样。

Using shared memory is an advantage when a single thread needs to access neighbouring elements from global memory - but this doesn't appear to be the case here.

这篇关于全局对共享内存在CUDA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆