如何使用合并的内存访问 [英] How to use coalesced memory access

查看:234
本文介绍了如何使用合并的内存访问的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有"N"个线程在需要全局内存中M * N个浮点数的设备上同时执行.访问合并的全局内存的正确方法是什么?在这件事上,共享内存如何提供帮助?

I have 'N' threads to perform simultaneously on device which they need M*N float from the global memory. What is the correct way to access the global memory coalesced? In this matter, how the shared memory can help?

推荐答案

通常,当相邻线程访问内存中的相邻单元时,可以实现良好的合并访问.因此,如果tid拥有线程的索引,则访问:

Usually, a good coalesced access can be achieved when the neighbouring threads access neighbouring cells in memory. So, if tid holds the index of your thread, then accessing:

  • arr[tid] ---完美融合
  • arr[tid+5] ---几乎是完美的,可能未对齐
  • arr[tid*4] ---由于存在空白,不再那么好了
  • arr[random(0..N)] ---太糟糕了!
  • arr[tid] --- gives perfect coalescence
  • arr[tid+5] --- is almost perfect, probably misaligned
  • arr[tid*4] --- is not that good anymore, because of the gaps
  • arr[random(0..N)] --- horrible!

我是从CUDA程序员的角度讲的,但是类似的规则也适用于其他地方,即使在简单的CPU编程中也是如此,尽管那里的影响并不大.

I am talking from the perspective of a CUDA programmer, but similar rules apply elsewhere as well, even in a simple CPU programming, although the impact is not that big there.

但是我有太多的数组,每个数组的长度都比我的线程数长2到3倍,因此不可避免地使用"arr [tid * 4]"之类的模式.这可能是一种解决方法?"

如果偏移量是较高2幂的倍数(例如16 * x或32 * x),那么这不是问题.因此,如果必须在for循环中处理较长的数组,则可以执行以下操作:

If the offset is a multiple of some higher 2-power (e.g. 16*x or 32*x) it is not a problem. So, if you have to process a rather long array in a for-loop, you can do something like this:

for (size_t base=0; i<arraySize; i+=numberOfThreads)
    process(arr[base+threadIndex])

(以上假设数组大小是线程数的倍数)

(the above asumes that array size is a multiple of the number of threads)

因此,如果线程数是32的倍数,则内存访问将很好.

So, if the number of threads is a multiple of 32, the memory access will be good.

再次注意:我是从CUDA程序员的角度讲的.对于不同的GPU/环境,您可能需要更少或更多的线程来实现完美的内存访问合并,但是应该应用类似的规则.

Note again: I am talking from the perspective of a CUDA programmer. For different GPUs/environment you might need less or more threads for perfect memory access coalescence, but similar rules should apply.

与"warp"大小相关的"32"与并行访问全局内存有关吗?

尽管不是直接的,但还是有联系的.全局存储器分为32、64和128字节的段,可以通过半扭曲来访问.对于给定的内存提取指令,您访问的段越多,它走的时间就越长.您可以在《 CUDA编程指南》中阅读更多详细信息,该主题有一整章的内容:"5.3.最大化内存吞吐量".

Although not directly, there is some connection. Global memory is divided into segments of 32, 64 and 128 bytes which are accessed by half-warps. The more segments you access for a given memory-fetch instruction, the longer it goes. You can read more into details in the "CUDA Programming Guide", there is a whole chapter on this topic: "5.3. Maximise Memory Throughput".

此外,我还听说了一些有关共享内存以本地化内存访问的信息.对于内存合并,这是首选还是有其自身的困难? 共享内存位于芯片上时,速度要快得多,但是它的大小是有限的.内存没有像全局那样分段,您几乎可以随意访问而无需付出任何代价.但是,存在宽度为4字节(32位int的大小)的存储库行.每个线程访问的内存地址的模数应为16(或32,具体取决于GPU).因此,地址[tid*4]将比[tid*5]慢得多,因为第一个访问仅访问存储区0、4、8、12,而后一个访问存储区0、5、10、15、4、9、14 ...(银行编号=地址模16).

In addition, I heard a little about shared memory to localize the memory access. Is this preferred for memory coalescing or have its own difficulties? Shared memory is much faster as it lies on-chip, but its size is limited. The memory is not segmented like global, you can access is almost-randomly at no penality cost. However, there are memory bank lines of width 4 bytes (size of 32-bit int). The address of memory that each thread access should be different modulo 16 (or 32, depending on the GPU). So, address [tid*4] will be much slower than [tid*5], because the first one access only banks 0, 4, 8, 12 and the latter 0, 5, 10, 15, 4, 9, 14, ... (bank id = address modulo 16).

同样,您可以在《 CUDA编程指南》中阅读更多内容.

Again, you can read more in the CUDA Programming Guide.

这篇关于如何使用合并的内存访问的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆