哪种内存访问模式对缓存的GPU更有效? [英] Which memory access pattern is more efficient for a cached GPU?

查看：162 发布时间：2020/5/8 19:33:27 caching memory opencl gpu gpgpu

本文介绍了哪种内存访问模式对缓存的GPU更有效?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

所以可以说我有一个全局的内存数组:

So lets say I have a global array of memory:

|a|b|c| |e|f|g| |i|j|k| |

有四个线程"(OpenCL中的本地工作项)访问该内存，并且有两种可能的访问模式(列是时间片，行是线程):

There are four 'threads' (local work items in OpenCL) accessing this memory, and two possible patterns for this access (columns are time slices, rows are threads):

   0 -> 1 -> 2 -> 3
t1 a -> b -> c -> .
t2 e -> f -> g -> .
t3 i -> j -> k -> .
t4 .    .    . `> .

以上模式将数组拆分为多个块，每个线程在每个时间片上迭代并访问一个块中的下一个元素.我相信这种访问方式对于CPU来说会很好，因为它可以最大化每个线程的缓存位置.同样，利用这种模式的循环可以很容易地被编译器展开.

The above pattern splits the array in to blocks with each thread iterating to and accessing the next element in a block per time slice. I believe this sort of access would work well for CPUs because it maximizes cache locality per thread. Also, loops utilizing this pattern can be easily unrolled by the compiler.

第二种模式:

   0 -> 1 -> 2 -> 3
t1 a -> e -> i -> .
t2 b -> f -> j -> .
t3 c -> g -> k -> .
t4 .    .    . `> .

上面的模式大步访问内存:例如，线程1先访问a，然后访问e，然后访问i等.这将使单位时间的缓存局部性最大化.考虑您在任何给定的时间片上都有64个大步向前"的工作项目.这意味着，在缓存行大小为64字节且元素为sizeof(float)的情况下，工作项1-16的读取将由工作项1的读取进行缓存.必须仔细选择每个单元的数据宽度/计数(其中"a"是上面的单元)，以避免访问未对齐.这些循环似乎不太容易展开(或者根本没有将Intel的Kernel Builder与CPU一起使用).我相信这种模式可以在GPU上很好地工作.

The above pattern accesses memory in strides: for example, thread 1 accesses a, then e, then i etc. This maximizes cache locality per unit time. Consider you have 64 work-items 'striding' at any given time slice. This means that, with a cache-line size of 64 bytes and elements of sizeof(float), work-items 1-16's read are cached by work-item 1's read. The data width/count per cell (where 'a' is a cell from above) has to be chosen carefully to avoid misaligned access. These loops don't seem to unroll as easily (or at all using Intel's Kernel Builder with the CPU). I believe this pattern would work well with a GPU.

我的目标是具有缓存层次结构的GPU.特别是AMD的最新架构(GCN).第二种访问模式是否是推销"的示例?我在某处的思考过程中错了吗?

I'm targeting GPUs with cache hierarchies. Specifically AMD's latest architecture (GCN). Is the second access pattern an example of 'coalescing'? Am I wrong in my thought process somewhere?

哪种内存访问模式对缓存的GPU更有效? [英] Which memory access pattern is more efficient for a cached GPU?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

哪种内存访问模式对缓存的GPU更有效? [英] Which memory access pattern is more efficient for a cached GPU?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭