如何使用合并的内存访问 [英] How to use coalesced memory access

查看：234 发布时间：2020/11/20 0:46:49 gpu shared-memory coalesce

本文介绍了如何使用合并的内存访问的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有"N"个线程在需要全局内存中M * N个浮点数的设备上同时执行.访问合并的全局内存的正确方法是什么?在这件事上，共享内存如何提供帮助?

I have 'N' threads to perform simultaneously on device which they need M*N float from the global memory. What is the correct way to access the global memory coalesced? In this matter, how the shared memory can help?

推荐答案

通常，当相邻线程访问内存中的相邻单元时，可以实现良好的合并访问.因此，如果tid拥有线程的索引，则访问:

Usually, a good coalesced access can be achieved when the neighbouring threads access neighbouring cells in memory. So, if tid holds the index of your thread, then accessing:

arr[tid] ---完美融合
arr[tid+5] ---几乎是完美的，可能未对齐
arr[tid*4] ---由于存在空白，不再那么好了
arr[random(0..N)] ---太糟糕了！

arr[tid] --- gives perfect coalescence
arr[tid+5] --- is almost perfect, probably misaligned
arr[tid*4] --- is not that good anymore, because of the gaps
arr[random(0..N)] --- horrible!

我是从CUDA程序员的角度讲的，但是类似的规则也适用于其他地方，即使在简单的CPU编程中也是如此，尽管那里的影响并不大.

I am talking from the perspective of a CUDA programmer, but similar rules apply elsewhere as well, even in a simple CPU programming, although the impact is not that big there.

但是我有太多的数组，每个数组的长度都比我的线程数长2到3倍，因此不可避免地使用"arr [tid * 4]"之类的模式.这可能是一种解决方法?"

如果偏移量是较高2幂的倍数(例如16 * x或32 * x)，那么这不是问题.因此，如果必须在for循环中处理较长的数组，则可以执行以下操作:

If the offset is a multiple of some higher 2-power (e.g. 16*x or 32*x) it is not a problem. So, if you have to process a rather long array in a for-loop, you can do something like this:

for (size_t base=0; i<arraySize; i+=numberOfThreads)
    process(arr[base+threadIndex])

(以上假设数组大小是线程数的倍数)

(the above asumes that array size is a multiple of the number of threads)

因此，如果线程数是32的倍数，则内存访问将很好.

So, if the number of threads is a multiple of 32, the memory access will be good.

再次注意:我是从CUDA程序员的角度讲的.对于不同的GPU/环境，您可能需要更少或更多的线程来实现完美的内存访问合并，但是应该应用类似的规则.

Note again: I am talking from the perspective of a CUDA programmer. For different GPUs/environment you might need less or more threads for perfect memory access coalescence, but similar rules should apply.

与"warp"大小相关的"32"与并行访问全局内存有关吗?

尽管不是直接的，但还是有联系的.全局存储器分为32、64和128字节的段，可以通过半扭曲来访问.对于给定的内存提取指令，您访问的段越多，它走的时间就越长.您可以在《 CUDA编程指南》中阅读更多详细信息，该主题有一整章的内容:"5.3.最大化内存吞吐量".

Although not directly, there is some connection. Global memory is divided into segments of 32, 64 and 128 bytes which are accessed by half-warps. The more segments you access for a given memory-fetch instruction, the longer it goes. You can read more into details in the "CUDA Programming Guide", there is a whole chapter on this topic: "5.3. Maximise Memory Throughput".

此外，我还听说了一些有关共享内存以本地化内存访问的信息.对于内存合并，这是首选还是有其自身的困难? 共享内存位于芯片上时，速度要快得多，但是它的大小是有限的.内存没有像全局那样分段，您几乎可以随意访问而无需付出任何代价.但是，存在宽度为4字节(32位int的大小)的存储库行.每个线程访问的内存地址的模数应为16(或32，具体取决于GPU).因此，地址[tid*4]将比[tid*5]慢得多，因为第一个访问仅访问存储区0、4、8、12，而后一个访问存储区0、5、10、15、4、9、14 ...(银行编号=地址模16).

In addition, I heard a little about shared memory to localize the memory access. Is this preferred for memory coalescing or have its own difficulties? Shared memory is much faster as it lies on-chip, but its size is limited. The memory is not segmented like global, you can access is almost-randomly at no penality cost. However, there are memory bank lines of width 4 bytes (size of 32-bit int). The address of memory that each thread access should be different modulo 16 (or 32, depending on the GPU). So, address [tid*4] will be much slower than [tid*5], because the first one access only banks 0, 4, 8, 12 and the latter 0, 5, 10, 15, 4, 9, 14, ... (bank id = address modulo 16).

同样，您可以在《 CUDA编程指南》中阅读更多内容.

Again, you can read more in the CUDA Programming Guide.

这篇关于如何使用合并的内存访问的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用合并的内存访问 [英] How to use coalesced memory access

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何使用合并的内存访问 [英] How to use coalesced memory access

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭