如果是CUDA的__shared__内存有用吗? [英] When is CUDA's __shared__ memory useful?

查看:557
本文介绍了如果是CUDA的__shared__内存有用吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有人可以帮助我了解如何使用共享内存一个很简单的例子吗?包含在CUDA C语言编程指南的例子似乎不相关的细节混乱。

Can someone please help me with a very simple example on how to use shared memory? The example included in the Cuda C programming guide seems cluttered by irrelevant details.

例如,如果我一个大数组复制到设备全局内存,并希望以方的每个元素,怎么能共享内存可用于加快这?或者是不是在这种情况下,有用吗?

For example, if I copy a large array to the device global memory and want to square each element, how can shared memory be used to speed this up? Or is it not useful in this case?

推荐答案

在你提到的具体情况,共享内存是没有用的,原因如下:每个数据元素只能使用一次。对于共享内存是有用的,你必须使用传输到共享内存几次数据,用良好的访问模式,有它的帮助。这样做的原因很简单:从全局内存只是阅读需要1个全局内存读取和共享零内存读取;读取到共享内存中第一次需要1读取全局内存和1个共享存储器读取,这需要更长的时间。

In the specific case you mention, shared memory is not useful, for the following reason: each data element is used only once. For shared memory to be useful, you must use data transferred to shared memory several times, using good access patterns, to have it help. The reason for this is simple: just reading from global memory requires 1 global memory read and zero shared memory reads; reading it into shared memory first would require 1 global memory read and 1 shared memory read, which takes longer.

下面是一个简单的例子,其中块中每个线程计算的对应值的平方,加上这两个其左,右的邻居的平均平方:

Here's a simple example, where each thread in the block computes the corresponding value, squared, plus the average of both its left and right neighbors, squared:

  __global__ void compute_it(float *data)
  {
     int tid = threadIdx.x;
     __shared__ float myblock[1024];
     float tmp;

     // load the thread's data element into shared memory
     myblock[tid] = data[tid];

     // ensure that all threads have loaded their values into
     // shared memory; otherwise, one thread might be computing
     // on unitialized data.
     __syncthreads();

     // compute the average of this thread's left and right neighbors
     tmp = (myblock[tid>0?tid-1:1023] + myblock[tid<1023?tid+1:0]) * 0.5f);
     // square the previousr result and add my value, squared
     tmp = tmp*tmp + myblock[tid]*myblock[tid];

     // write the result back to global memory
     data[tid] = myblock[tid];
  }

请注意,这个设想只使用一个模块的工作。扩展到更多的块应该直截了当。假定块尺寸(1024,1,1)和网格尺寸(1,1,1)

Note that this is envisioned to work using only one block. The extension to more blocks should be straightforward. Assumes block dimension (1024, 1, 1) and grid dimension (1, 1, 1).

这篇关于如果是CUDA的__shared__内存有用吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆