CUDA计算能力2.0。全局内存访问模式 [英] CUDA Compute Capability 2.0. Global memory access pattern

查看:466
本文介绍了CUDA计算能力2.0。全局内存访问模式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

通过CUDA Compute Capability 2.0(Fermi),全局内存访问通过768 KB二级缓存工作。看起来,开发人员不再关心全局内存库了。但是全局内存仍然很慢,所以正确的访问模式很重要。现在的重点是尽可能多地使用/重用L2。我的问题是,如何?我会感谢一些详细的信息,L2如何工作,我应该如何组织和访问全局内存,如果我需要,例如,100-200元素数组每线程。

From CUDA Compute Capability 2.0 (Fermi) global memory access works through 768 KB L2 cache. It looks, developer don't care anymore about global memory banks. But global memory is still very slow, so the right access pattern is important. Now the point is to use/reuse L2 as much as possible. And my question is, how? I would be thankful for some detailed info, how L2 works and how should I organize and access global memory if I need, for example, 100-200 elements array per thread.

推荐答案

L2缓存有助于在某些方面,但它不会消除对全局内存的合并访问的需要。简而言之,合并访问意味着对于给定读取(或写入)指令,warp中的各个线程正在读取(或写入)全局存储器中的相邻的连续位置,优选地在128字节边界上作为一组对齐。这将最有效地利用可用的内存带宽。

L2 cache helps in some ways, but it does not obviate the need for coalesced access of global memory. In a nutshell, coalesced access means that for a given read (or write) instruction, individual threads in a warp are reading (or writing) adjacent, contiguous locations in global memory, preferably that are aligned as a group on a 128-byte boundary. This will result in the most effective utilization of the available memory bandwidth.

在实践中,这通常不难完成。例如:

In practice this is often not difficult to accomplish. For example:

int idx=threadIdx.x + (blockDim.x * blockIdx.x);
int mylocal = global_array[idx];

将在warp中的所有线程中给出合并(读)访问,假设 global_array 是在全局内存中使用cudaMalloc以普通方式分配的。这种访问类型允许100%使用可用的内存带宽。

will give coalesced (read) access across all the threads in a warp, assuming global_array is allocated in an ordinary fashion using cudaMalloc in global memory. This type of access makes 100% usage of the available memory bandwidth.

一个关键的要点是内存事务通常发生在128字节的块中,这恰恰是大小的缓存线。如果您请求一个块中的一个字节,整个块将被读取(通常保存在L2中)。如果稍后从该块读取其他数据,则通常将从L2服务,除非它已被其他内存活动驱逐。这意味着以下序列:

A key takeaway is that memory transactions ordinarily occur in 128-byte blocks, which happens to be the size of a cache line. If you request even one of the bytes in a block, the entire block will be read (and stored in L2, normally). If you later read other data from that block, it will normally be serviced from L2, unless it has been evicted by other memory activity. This means that the following sequence:

int mylocal1 = global_array[0];
int mylocal2 = global_array[1];
int mylocal3 = global_array[31];

通常将从单个128字节块提供服务。第一次读取 mylocal1 将触发128字节读取。对于 mylocal2 的第二次读取通常将从缓存的值(在L2或L1中)而不是通过触发对存储器的另一次读取。但是,如果算法可以适当修改,最好从多个线程连续读取所有数据,如第一个示例所示。这可能只是聪明的数据组织问题,例如使用数组结构而不是结构数组。

would all typically be serviced from a single 128-byte block. The first read for mylocal1 will trigger the 128 byte read. The second read for mylocal2 would normally be serviced from the cached value (in L2 or L1) not by triggering another read from memory. However, if the algorithm can be suitably modified, it's better to read all your data contiguously from multiple threads, as in the first example. This may be just a matter of clever organization of data, for example using Structures of Arrays rather than Arrays of structures.

在许多方面,这类似于CPU缓存行为。缓存线的概念是类似的,以及从缓存处理请求的行为。

In many respects, this is similar to CPU cache behavior. The concept of a cache line is similar, along with the behavior of servicing requests from the cache.

Fermi L1和L2可以支持写回和直写。 L1在每个SM的基础上是可用的,并且可以被共享存储器可配置地分割为16KB L1(和48KB SM)或48KB L1(和16KB SM)。 L2在设备上是统一的,是768KB。

Fermi L1 and L2 can support write-back and write-through. L1 is available on a per-SM basis, and is configurably split with shared memory to be either 16KB L1 (and 48KB SM) or 48KB L1 (and 16KB SM). L2 is unified across the device and is 768KB.

我提供的一些建议是不假设L2缓存只是修复了大量的内存访问。 GPU缓存比CPU上的等效缓存小得多,因此更容易陷入麻烦。一般的建议是简单地编码,好像缓存不在那里。而不是像缓存阻塞这样的面向CPU的策略,通常最好将编码工作集中在生成聚合访问上,然后在某些特定情况下可能利用共享内存。然后对于不可避免的情况,我们不能在所有情况下完成内存访问,我们让缓存提供他们的好处。

Some advice I would offer is to not assume that the L2 cache just fixes sloppy memory accesses. The GPU caches are much smaller than equivalent caches on CPUs, so it's easier to get into trouble there. A general piece of advice is simply to code as if the caches were not there. Rather than CPU oriented strategies like cache-blocking, it's usually better to focus your coding effort on generating coalesced accesses and then possibly make use of shared memory in some specific cases. Then for the inevitable cases where we can't make perfect memory accesses in all situations, we let the caches provide their benefit.

您可以获得更深入的指导请查看一些可用的 NVIDIA在线讲座。例如,全局内存使用情况&策略在线讲座(以及幻灯片)或 CUDA共享内存缓存在线讲座对本主题有帮助。您还可以阅读设备内存访问部分 CUDA C编程指南

You can get more in-depth guidance by looking at some of the available NVIDIA webinars. For example, the Global Memory Usage & Strategy webinar (and slides ) or the CUDA Shared Memory & Cache webinar would be instructive for this topic. You may also want to read the Device Memory Access section of the CUDA C Programming Guide.

这篇关于CUDA计算能力2.0。全局内存访问模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆