为什么 CUDA 中的常量内存大小受到限制? [英] Why is the constant memory size limited in CUDA?

查看:25
本文介绍了为什么 CUDA 中的常量内存大小受到限制?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据 CUDA C 编程指南";,只有在命中多处理器常量缓存时,常量内存访问才会受益(第 5.3.2.4 节)1.否则,半扭曲的内存请求可能比合并全局内存读取的情况更多.那么为什么恒定的内存大小限制为 64 KB?

According to "CUDA C Programming Guide", a constant memory access benefits only if a multiprocessor constant cache is hit (Section 5.3.2.4)1. Otherwise there can be even more memory requests for a half-warp than in case of the coalesced global memory read. So why the constant memory size is limited to 64 KB?

为了不问两次,再问一个问题.据我了解,在 Fermi 架构中,纹理缓存与 L2 缓存相结合.纹理使用是否仍然有意义,或者全局内存读取以相同的方式缓存?

One more question in order not to ask twice. As far as I understand, in the Fermi architecture the texture cache is combined with the L2 cache. Does texture usage still make sense or the global memory reads are cached in the same manner?

1恒定内存(第 5.3.2.4 节)

常量内存空间驻留在设备内存中,缓存在F.3.1和F.4.1节中提到的常量缓存中.

The constant memory space resides in device memory and is cached in the constant cache mentioned in Sections F.3.1 and F.4.1.

对于计算能力为 1.x 的设备,对 warp 的恒定内存请求首先被拆分为两个请求,每个请求一个半warp,它们独立发出.

For devices of compute capability 1.x, a constant memory request for a warp is first split into two requests, one for each half-warp, that are issued independently.

然后将请求拆分为与初始请求中不同的内存地址一样多的单独请求,从而将吞吐量降低与单独请求数量相等的因子.

A request is then split into as many separate requests as there are different memory addresses in the initial request, decreasing throughput by a factor equal to the number of separate requests.

然后在缓存命中的情况下以常量缓存的吞吐量为结果请求提供服务,否则以设备内存的吞吐量提供服务.

The resulting requests are then serviced at the throughput of the constant cache in case of a cache hit, or at the throughput of device memory otherwise.

推荐答案

计算能力 1.0-3.0 设备的常量内存大小为 64 KB.缓存工作集只有 8KB(参见 CUDA 编程指南 v4.2 表 F-2).

The constant memory size is 64 KB for compute capability 1.0-3.0 devices. The cache working set is only 8KB (see the CUDA Programming Guide v4.2 Table F-2).

驱动程序、编译器和声明为 __device__ __constant__ 的变量使用常量内存.驱动程序使用常量内存来传递参数、纹理绑定等.编译器在许多指令中使用常量(参见反汇编).

Constant memory is used by the driver, compiler, and variables declared __device__ __constant__. The driver uses constant memory to communicate parameters, texture bindings, etc. The compiler uses constants in many of the instructions (see disassembly).

可以使用主机运行时函数 cudaMemcpyToSymbol()cudaMemcpyFromSymbol() 读取和写入放置在常量内存中的变量(参见 CUDA Programming Guide v4.2 section B.2.2).常量内存在设备内存中,但通过常量缓存访问.

Variables placed in constant memory can be read and written using the host runtime functions cudaMemcpyToSymbol() and cudaMemcpyFromSymbol() (see the CUDA Programming Guide v4.2 section B.2.2). Constant memory is in device memory but is accessed through the constant cache.

在 Fermi 纹理上,常量、L1 和 I-Cache 都是每个 SM 内部或周围的 1 级缓存.所有一级缓存都通过二级缓存访问设备内存.

On Fermi texture, constant, L1 and I-Cache are all level 1 caches in or around each SM. All level 1 caches access device memory through the L2 cache.

64 KB 常量限制是每个 CU 模块,它是一个 CUDA 编译单元.CUmodule 的概念隐藏在 CUDA 运行时,但可以通过 CUDA Driver API 访问.

The 64 KB constant limit is per CUmodule which is a CUDA compilation unit. The concept of CUmodule is hidden under the CUDA runtime but accessible by the CUDA Driver API.

这篇关于为什么 CUDA 中的常量内存大小受到限制?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆