如何使用合并的内存访问 [英] How to use coalesced memory access
问题描述
我有"N"个线程在需要全局内存中M * N个浮点数的设备上同时执行.访问合并的全局内存的正确方法是什么?在这件事上,共享内存如何提供帮助?
I have 'N' threads to perform simultaneously on device which they need M*N float from the global memory. What is the correct way to access the global memory coalesced? In this matter, how the shared memory can help?
推荐答案
通常,当相邻线程访问内存中的相邻单元时,可以实现良好的合并访问.因此,如果tid
拥有线程的索引,则访问:
Usually, a good coalesced access can be achieved when the neighbouring threads access neighbouring cells in memory. So, if tid
holds the index of your thread, then accessing:
-
arr[tid]
---完美融合 -
arr[tid+5]
---几乎是完美的,可能未对齐 -
arr[tid*4]
---由于存在空白,不再那么好了 -
arr[random(0..N)]
---太糟糕了!
arr[tid]
--- gives perfect coalescencearr[tid+5]
--- is almost perfect, probably misalignedarr[tid*4]
--- is not that good anymore, because of the gapsarr[random(0..N)]
--- horrible!
我是从CUDA程序员的角度讲的,但是类似的规则也适用于其他地方,即使在简单的CPU编程中也是如此,尽管那里的影响并不大.
I am talking from the perspective of a CUDA programmer, but similar rules apply elsewhere as well, even in a simple CPU programming, although the impact is not that big there.
但是我有太多的数组,每个数组的长度都比我的线程数长2到3倍,因此不可避免地使用"arr [tid * 4]"之类的模式.这可能是一种解决方法?"
如果偏移量是较高2幂的倍数(例如16 * x或32 * x),那么这不是问题.因此,如果必须在for循环中处理较长的数组,则可以执行以下操作:
If the offset is a multiple of some higher 2-power (e.g. 16*x or 32*x) it is not a problem. So, if you have to process a rather long array in a for-loop, you can do something like this:
for (size_t base=0; i<arraySize; i+=numberOfThreads)
process(arr[base+threadIndex])
(以上假设数组大小是线程数的倍数)
(the above asumes that array size is a multiple of the number of threads)
因此,如果线程数是32的倍数,则内存访问将很好.
So, if the number of threads is a multiple of 32, the memory access will be good.
再次注意:我是从CUDA程序员的角度讲的.对于不同的GPU/环境,您可能需要更少或更多的线程来实现完美的内存访问合并,但是应该应用类似的规则.
Note again: I am talking from the perspective of a CUDA programmer. For different GPUs/environment you might need less or more threads for perfect memory access coalescence, but similar rules should apply.
与"warp"大小相关的"32"与并行访问全局内存有关吗?
尽管不是直接的,但还是有联系的.全局存储器分为32、64和128字节的段,可以通过半扭曲来访问.对于给定的内存提取指令,您访问的段越多,它走的时间就越长.您可以在《 CUDA编程指南》中阅读更多详细信息,该主题有一整章的内容:"5.3.最大化内存吞吐量".
Although not directly, there is some connection. Global memory is divided into segments of 32, 64 and 128 bytes which are accessed by half-warps. The more segments you access for a given memory-fetch instruction, the longer it goes. You can read more into details in the "CUDA Programming Guide", there is a whole chapter on this topic: "5.3. Maximise Memory Throughput".
此外,我还听说了一些有关共享内存以本地化内存访问的信息.对于内存合并,这是首选还是有其自身的困难?
共享内存位于芯片上时,速度要快得多,但是它的大小是有限的.内存没有像全局那样分段,您几乎可以随意访问而无需付出任何代价.但是,存在宽度为4字节(32位int的大小)的存储库行.每个线程访问的内存地址的模数应为16(或32,具体取决于GPU).因此,地址[tid*4]
将比[tid*5]
慢得多,因为第一个访问仅访问存储区0、4、8、12,而后一个访问存储区0、5、10、15、4、9、14 ...(银行编号=地址模16).
In addition, I heard a little about shared memory to localize the memory access. Is this preferred for memory coalescing or have its own difficulties?
Shared memory is much faster as it lies on-chip, but its size is limited. The memory is not segmented like global, you can access is almost-randomly at no penality cost. However, there are memory bank lines of width 4 bytes (size of 32-bit int). The address of memory that each thread access should be different modulo 16 (or 32, depending on the GPU). So, address [tid*4]
will be much slower than [tid*5]
, because the first one access only banks 0, 4, 8, 12 and the latter 0, 5, 10, 15, 4, 9, 14, ... (bank id = address modulo 16).
同样,您可以在《 CUDA编程指南》中阅读更多内容.
Again, you can read more in the CUDA Programming Guide.
这篇关于如何使用合并的内存访问的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!