CUDA合并内存 [英] CUDA coalesced memory

查看:578
本文介绍了CUDA合并内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在CUDA全局内存事务中合并是什么?即使经过我的CUDA指南,我也不明白。怎么做?在CUDA编程指南矩阵示例中,逐行访问矩阵称为合并或col ..由col ..被称为合并?
这是正确的,为什么?

What is coalesced in CUDA global memory transaction? I couldn't understand even after going through my CUDA guide. How to do it? In CUDA programming guide matrix example, accessing the matrix row by row is called coalesced or col.. by col.. is called coalesced? Which is correct and why?

推荐答案

这些信息很可能只适用于计算能力1。 x或cuda 2.0。更近的架构和cuda 3.0具有更复杂的全局内存访问,事实上合并的全局负载甚至没有配置这些芯片。

It's likely that this information applies only to compute capabality 1.x, or cuda 2.0. More recent architectures and cuda 3.0 have more sophisticated global memory access and in fact "coalesced global loads" are not even profiled for these chips.

此外,此逻辑可应用于共享内存,以避免银行冲突。

Also, this logic can be applied to shared memory to avoid bank conflicts.

一个在其中所有线程在半经历同时访问全局内存。这是太简单,但正确的方法是只有连续的线程访问连续的内存地址。

A coalesced memory transaction is one in which all of the threads in a half-warp access global memory at the same time. This is oversimple, but the correct way to do it is just have consecutive threads access consecutive memory addresses.

因此,如果线程0,1,2和3读取全局内存0x0,0x4,0x8和0xc,它应该是一个合并的读。

So, if threads 0, 1, 2, and 3 read global memory 0x0, 0x4, 0x8, and 0xc, it should be a coalesced read.

在一个矩阵的例子中,记住你希望你的矩阵线性地驻留在内存。你可以这样做,而你的内存访问应该反映你的矩阵如何布局。所以,下面的3x4矩阵

In a matrix example, keep in mind that you want your matrix to reside linearly in memory. You can do this however you want, and your memory access should reflect how your matrix is laid out. So, the 3x4 matrix below

0 1 2 3
4 5 6 7
8 9 a b

可以在行后面做,像这样,以便(r,c)映射到内存* 4 + c)

could be done row after row, like this, so that (r,c) maps to memory (r*4 + c)

0 1 2 3 4 5 6 7 8 9 a b

假设你需要访问一次元素,并且你有四个线程。哪些线程将用于哪个元素?可能是

Suppose you need to access element once, and say you have four threads. Which threads will be used for which element? Probably either

thread 0:  0, 1, 2
thread 1:  3, 4, 5
thread 2:  6, 7, 8
thread 3:  9, a, b

thread 0:  0, 4, 8
thread 1:  1, 5, 9
thread 2:  2, 6, a
thread 3:  3, 7, b

?这将导致合并读取,并且不会?

Which is better? Which will result in coalesced reads, and which will not?

无论哪种方式,每个线程都进行三次访问。让我们看看第一次访问,看看线程是否连续访问内存。在第一个选项中,第一次访问是0,3,6,9。不连续,不合并。第二个选项,它的0,1,2,3.连续!合并! Yay!

Either way, each thread makes three accesses. Let's look at the first access and see if the threads access memory consecutively. In the first option, the first access is 0, 3, 6, 9. Not consecutive, not coalesced. The second option, it's 0, 1, 2, 3. Consecutive! Coalesced! Yay!

最好的方法是编写你的内核,然后配置它,看看你是否有非合并的全局加载和存储。

The best way is probably to write your kernel and then profile it to see if you have non-coalesced global loads and stores.

这篇关于CUDA合并内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆