CUDA 的 __shared__ 内存何时有用? [英] When is CUDA's __shared__ memory useful?

查看:16
本文介绍了CUDA 的 __shared__ 内存何时有用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人可以帮我举一个关于如何使用共享内存的简单示例吗?Cuda C 编程指南中包含的示例似乎被不相关的细节弄得一团糟.

Can someone please help me with a very simple example on how to use shared memory? The example included in the Cuda C programming guide seems cluttered by irrelevant details.

例如,如果我将一个大数组复制到设备全局内存并希望对每个元素进行平方,如何使用共享内存来加快速度?还是在这种情况下没用?

For example, if I copy a large array to the device global memory and want to square each element, how can shared memory be used to speed this up? Or is it not useful in this case?

推荐答案

在你提到的具体情况下,共享内存是没有用的,原因如下:每个数据元素只使用一次.为了使共享内存有用,您必须多次使用传输到共享内存的数据,并使用良好的访问模式,才能获得帮助.原因很简单:仅仅从全局内存中读取需要 1 次全局内存读取和 0 次共享内存读取;首先将其读入共享内存需要 1 次全局内存读取和 1 次共享内存读取,这需要更长的时间.

In the specific case you mention, shared memory is not useful, for the following reason: each data element is used only once. For shared memory to be useful, you must use data transferred to shared memory several times, using good access patterns, to have it help. The reason for this is simple: just reading from global memory requires 1 global memory read and zero shared memory reads; reading it into shared memory first would require 1 global memory read and 1 shared memory read, which takes longer.

这是一个简单的例子,块中的每个线程计算对应的值,平方,加上其左右邻居的平均值,平方:

Here's a simple example, where each thread in the block computes the corresponding value, squared, plus the average of both its left and right neighbors, squared:

  __global__ void compute_it(float *data)
  {
     int tid = threadIdx.x;
     __shared__ float myblock[1024];
     float tmp;

     // load the thread's data element into shared memory
     myblock[tid] = data[tid];

     // ensure that all threads have loaded their values into
     // shared memory; otherwise, one thread might be computing
     // on unitialized data.
     __syncthreads();

     // compute the average of this thread's left and right neighbors
     tmp = (myblock[tid > 0 ? tid - 1 : 1023] + myblock[tid < 1023 ? tid + 1 : 0]) * 0.5f;
     // square the previousr result and add my value, squared
     tmp = tmp*tmp + myblock[tid] * myblock[tid];

     // write the result back to global memory
     data[tid] = tmp;
  }

请注意,这是设想只使用一个块.对更多块的扩展应该很简单.假设块维度 (1024, 1, 1) 和网格维度 (1, 1, 1).

Note that this is envisioned to work using only one block. The extension to more blocks should be straightforward. Assumes block dimension (1024, 1, 1) and grid dimension (1, 1, 1).

这篇关于CUDA 的 __shared__ 内存何时有用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆