哪种内存访问模式对缓存的GPU更有效? [英] Which memory access pattern is more efficient for a cached GPU?

查看:162
本文介绍了哪种内存访问模式对缓存的GPU更有效?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以可以说我有一个全局的内存数组:

So lets say I have a global array of memory:

|a|b|c| |e|f|g| |i|j|k| |

有四个线程"(OpenCL中的本地工作项)访问该内存,并且有两种可能的访问模式(列是时间片,行是线程):

There are four 'threads' (local work items in OpenCL) accessing this memory, and two possible patterns for this access (columns are time slices, rows are threads):

   0 -> 1 -> 2 -> 3
t1 a -> b -> c -> .
t2 e -> f -> g -> .
t3 i -> j -> k -> .
t4 .    .    . `> .

以上模式将数组拆分为多个块,每个线程在每个时间片上迭代并访问一个块中的下一个元素.我相信这种访问方式对于CPU来说会很好,因为它可以最大化每个线程的缓存位置.同样,利用这种模式的循环可以很容易地被编译器展开.

The above pattern splits the array in to blocks with each thread iterating to and accessing the next element in a block per time slice. I believe this sort of access would work well for CPUs because it maximizes cache locality per thread. Also, loops utilizing this pattern can be easily unrolled by the compiler.

第二种模式:

   0 -> 1 -> 2 -> 3
t1 a -> e -> i -> .
t2 b -> f -> j -> .
t3 c -> g -> k -> .
t4 .    .    . `> .

上面的模式大步访问内存:例如,线程1先访问a,然后访问e,然后访问i等.这将使单位时间的缓存局部性最大化.考虑您在任何给定的时间片上都有64个大步向前"的工作项目.这意味着,在缓存行大小为64字节且元素为sizeof(float)的情况下,工作项1-16的读取将由工作项1的读取进行缓存.必须仔细选择每个单元的数据宽度/计数(其中"a"是上面的单元),以避免访问未对齐.这些循环似乎不太容易展开(或者根本没有将Intel的Kernel Builder与CPU一起使用).我相信这种模式可以在GPU上很好地工作.

The above pattern accesses memory in strides: for example, thread 1 accesses a, then e, then i etc. This maximizes cache locality per unit time. Consider you have 64 work-items 'striding' at any given time slice. This means that, with a cache-line size of 64 bytes and elements of sizeof(float), work-items 1-16's read are cached by work-item 1's read. The data width/count per cell (where 'a' is a cell from above) has to be chosen carefully to avoid misaligned access. These loops don't seem to unroll as easily (or at all using Intel's Kernel Builder with the CPU). I believe this pattern would work well with a GPU.

我的目标是具有缓存层次结构的GPU.特别是AMD的最新架构(GCN).第二种访问模式是否是推销"的示例?我在某处的思考过程中错了吗?

I'm targeting GPUs with cache hierarchies. Specifically AMD's latest architecture (GCN). Is the second access pattern an example of 'coalescing'? Am I wrong in my thought process somewhere?

推荐答案

我认为答案取决于是否访问全局或本地内存.如果要从全局内存中提取数据,则需要担心合并读取(即,连续的块,第二个示例).但是,如果要从本地内存中提取数据,则需要担心存储区冲突.我有一些但不是很多经验,所以我并不是说这是绝对的事实.

I think the answer depends on whether or not the accesses are to global or local memory. If you are pulling the data from global memory, then you need to worry about coalescing the reads (ie contiguous blocks, second example). However, if you are pulling the data from local memory, then you need to worry about bank conflicts. I have some but not a lot of experience, so I'm not stating this as absolute truth.

在阅读了GCN之后,我认为这里的缓存没有什么不同.如果您重复读取/写入相同的元素,则基本上可以将它们视为只是加快全局内存的速度.附带说明一下,感谢您提出问题,因为阅读新架构非常有趣.

After reading up on GCN, I don't think the caches make a difference here. You can basically think of them as just speeding up global memory if you repeatedly read/write the same elements. On a side note, thanks for asking the question, because reading up on the new architecture is pretty interesting.

这是关于本地和全局内存存储区的堆栈溢出讨论:

Edit 2: Here's a nice Stack Overflow discussion of banks for local and global memory: Why aren't there bank conflicts in global memory for Cuda/OpenCL?

这篇关于哪种内存访问模式对缓存的GPU更有效?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆