为什么CUDA中的常量内存大小有限? [英] Why is the constant memory size limited in CUDA?

查看:762
本文介绍了为什么CUDA中的常量内存大小有限?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据CUDA C编程指南 ,只有在命中多处理器常量缓存(第5.3.2.4节) 1 时,常量内存访问才有好处。否则,与在合并的全局存储器读取的情况下相比,可以存在对于半变形的甚至更多的存储器请求。那么为什么常量内存大小限制为64 KB?

According to "CUDA C Programming Guide", a constant memory access benefits only if a multiprocessor constant cache is hit (Section 5.3.2.4)1. Otherwise there can be even more memory requests for a half-warp than in case of the coalesced global memory read. So why the constant memory size is limited to 64 KB?

还有一个问题,为了不要问两次。据我所知,在费米架构中,纹理缓存与L2缓存结合。纹理使用仍然有意义,或者全局内存读取以相同的方式缓存?

One more question in order not to ask twice. As far as I understand, in the Fermi architecture the texture cache is combined with the L2 cache. Does texture usage still make sense or the global memory reads are cached in the same manner?


1

常量内存位于设备内存中,在F.3.1和F.4.1节中提到的常量缓存中。

The constant memory space resides in device memory and is cached in the constant cache mentioned in Sections F.3.1 and F.4.1.

对于计算能力为1.x的设备,对warp的恒定内存请求首先分为

For devices of compute capability 1.x, a constant memory request for a warp is first split into two requests, one for each half-warp, that are issued independently.

然后,将一个请求分成与初始请求中存在不同内存地址的单独请求一样多的请求,将吞吐量降低等于分离请求数的因子。

A request is then split into as many separate requests as there are different memory addresses in the initial request, decreasing throughput by a factor equal to the number of separate requests.

然后,在缓存命中的情况下,以常量缓存的吞吐量处理所得到的请求,或者。

The resulting requests are then serviced at the throughput of the constant cache in case of a cache hit, or at the throughput of device memory otherwise.


推荐答案

常量内存大小为64 KB,用于compute能力1.0-3.0设备。缓存工作集只有8KB(见CUDA编程指南v4.2表F-2)。

The constant memory size is 64 KB for compute capability 1.0-3.0 devices. The cache working set is only 8KB (see the CUDA Programming Guide v4.2 Table F-2).

常量内存由驱动程序,编译器和声明的变量 __ device__ __constant __ 。驱动程序使用常量内存来传递参数,纹理绑定等。编译器在许多指令中使用常量(参见反汇编)。

Constant memory is used by the driver, compiler, and variables declared __device__ __constant__. The driver uses constant memory to communicate parameters, texture bindings, etc. The compiler uses constants in many of the instructions (see disassembly).

置于常量内存中的变量可以是使用主机运行时函数 cudaMemcpyToSymbol() cudaMemcpyFromSymbol()读取和写入(请参阅CUDA Programming Guide v4.2 B.2.2节)。

Variables placed in constant memory can be read and written using the host runtime functions cudaMemcpyToSymbol() and cudaMemcpyFromSymbol() (see the CUDA Programming Guide v4.2 section B.2.2). Constant memory is in device memory but is accessed through the constant cache.

在费米纹理上,常数,L1和I-Cache都是每个SM中或周围的1级缓存。所有1级缓存通过L2缓存访问设备内存。

On Fermi texture, constant, L1 and I-Cache are all level 1 caches in or around each SM. All level 1 caches access device memory through the L2 cache.

64 KB常量限制是每CU模块,它是一个CUDA编译单元。 CUmodule的概念隐藏在CUDA运行时下,但是可以通过CUDA驱动程序API访问。

The 64 KB constant limit is per CUmodule which is a CUDA compilation unit. The concept of CUmodule is hidden under the CUDA runtime but accessible by the CUDA Driver API.

这篇关于为什么CUDA中的常量内存大小有限?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆