CUDA __constant__引用全局内存.哪个缓存? [英] CUDA __constant__ deference to global memory. Which cache?

查看:111
本文介绍了CUDA __constant__引用全局内存.哪个缓存?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用了 __ constant __ 变量,而不是将大量参数传递给内核.这个变量是一个结构数组,其中包含许多指向全局数据的指针(这些指针将是一个参数列表).一个用于多个不同数据集以调用内核的数组.然后,内核访问该数组,并取消引用全局的适当数据.我的问题是,这些数据是通过L2还是通过常量缓存进行缓存的?而且,如果是后者,并且如果通过 __ ldg()加载,它将通过L1还是仍然是常量缓存?

更具体地说,数据本身位于全局位置,但是内核会取消引用 __ constant __ 变量以获取数据.这会对缓存产生不利影响吗?

解决方案

由(bank)访问由立即常量(操作码中的常量)访问的常量或通过( ldc 指令访问的)索引常量.,偏移)对,而不是按地址.这些读取通过立即常量和索引常量缓存.在某些芯片上,这些是相同的缓存.常量访问的示例包括:

 //立即常量添加r0,r1,c [bank] [offset]//r1具有库的压缩版本,偏移量LDC r0,r1 

传递了cc2.0及更高版本的参数,以便您可以立即看到恒定的访问权限.

恒定访问通过恒定的内存层次结构,最终导致一个全局地址,该地址可以在系统内存或设备内存中.

如果将常量变量设置为指向global的指针,则将通过数据层次结构读取数据.

如果定义const变量,则编译器可以选择将只读数据放入存储区/偏移量或地址中.

如果您查看SASS(nvdisasm或工具),则会看到 LD 说明.取决于芯片,此数据可能会先缓存在L1/Tex缓存中,然后再缓存在L2缓存中.

 分享LDS/STS/ATOMS->共享内存通用的LD/ST(通用到共享)->共享内存LD/ST(通用到通用)->L1/TEX->L2LD/ST(通用到本地)->L1/TEX->L2当地的LDL/STL(本地)->L1/TEX->L2全球的LDG/STG(全局)->TEX->L2不变最不发达国家->索引常量缓存->...->L2 

L2丢失可以进入设备内存或固定的系统内存.

在您提到常量的情况下,很有可能通过立即数访问常量(假设常量大小合理,最佳性能),而取消引用的指针将导致全局存储器访问.

在GK110上, LDG 指令被缓存在纹理缓存中.

在Maxwell上,将 LDG.CI 指令缓存在纹理缓存中. LDG.CA 操作被缓存在纹理缓存(GM20x)中.所有其他 LDG 访问都通过纹理缓存,但不会在warp指令的生存期之后缓存.

Instead of passing lots of arguments to a kernel, I use a __constant__ variable. This variable is an array of structures which contains many pointers to data in global (these pointer would be a list of arguments); an array for the multiple different datasets to call a kernel on. Then the kernel accesses this array and dereferences to global the appropriate data. My question is, does this data get cached through L2 or the constant cache? Moreover, if the latter and, if loaded via __ldg(), does it go through L1 or still the constant cache?

To be more specific the data itself sits in global, however the kernel dereferences a __constant__ variable to get to it. Does this adversely affect caching?

解决方案

Constant variables accessed by immediate constants (constants in the opcode) or indexed constants (accessed via ldc instruction) are accessed by (bank, offset) pair, not by address. These reads go through the immediate constant and index constant caches. On some chips these are the same cache. Examples of constant accesses are:

// immediate constant
ADD r0, r1, c[bank][offset]

// r1 has packed version of bank, offset
LDC r0, r1

Arguments for cc2.0 and above are passed such that you will see immediate constant accesses.

Constant accesses go through the constant memory hierarchy which in end results in a global address which can be in system memory or device memory.

If you set a constant variable to a pointer to global then the data will be read through the data hierarchy.

If you define a const variable the compiler can choose to put the read only data in either a bank/offset or an address.

If you review the SASS (nvdisasm or tools) you will see LD instructions. Depending on the chip this data may be cached in the L1/Tex cache then L2 cache.

SHARED
LDS/STS/ATOMS             -> shared memory

GENERIC
LD/ST (generic to shared) -> shared memory
LD/ST (generic to global) -> L1/TEX -> L2
LD/ST (generic to local)  -> L1/TEX -> L2

LOCAL
LDL/STL (local)           -> L1/TEX -> L2

GLOBAL
LDG/STG (global)          -> TEX    -> L2

INDEXED CONSTANT
LDC -> indexed constant cache -> ...-> L2

L2 misses can go to device memory or pinned system memory.

In the case you mention the constant variable will very likely be accessed via an immediate constant (best performance assuming reasonable size of constants) and the de-referenced pointer will result in a global memory access.

On GK110 LDG instructions are cached in the texture cache.

On Maxwell LDG.CI instructions are cached in the texture cache. LDG.CAoperations are cached in the texture cache (GM20x). All other LDG accesses go through the texture cache but are not cached beyond the lifetime of the warp instruction.

这篇关于CUDA __constant__引用全局内存.哪个缓存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆