CUDA:纹理内存对于加快计算能力2.x和更新版本的访问时间仍然有用吗? [英] CUDA: Is texture memory still useful to speed up access times for compute capability 2.x and newer?

查看:160
本文介绍了CUDA:纹理内存对于加快计算能力2.x和更新版本的访问时间仍然有用吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在写一个图像处理应用程序,我必须以非聚合方式获取像素数据。



最初,我使用全局内存实现了我的算法。后来我使用纹理内存重新实现它。令我惊讶的是,它变得更慢了!我想,可能是cudaMalloc / text1Dfetch样式有问题,所以我把它改为cudaArray / tex2D。没有更改。



然后我偶然发现了Shane Cook的CUDA Programming,他写道:


由于compute 1.x硬件没有缓存,每个SM的6-8K纹理内存提供了
only方法来真正缓存这些设备上的数据。然而,随着Fermi及其高达48K
L1高速缓存和高达768K共享的L2高速缓存的出现,这使得纹理存储器用于其高速缓存
属性基本上过时。


我有GeForce的代码GT 620M(费米,计算盖2.1)。



所以我需要一些专业人士的建议!我应该深入纹理内存与其纹理缓存尝试优化性能?或者我应该更好地坚持全局内存和L1 / L2缓存?

解决方案确实可以在计算能力的设备上有用> = 2.0。



纹理和cudaArrays可以使用存储在空间中的内存填充曲线,这可以允许由于更好的2D空间局部性而具有更好的高速缓存命中率。纹理高速缓存与其他高速缓存是分开的。



<所以它有自己的专用内存和带宽,从中读取不会干扰其他缓存。



纹理还提供内置功能,例如插值,各种寻址模式(钳位,包装,镜像)和使用浮点坐标的归一化寻址。这些可以在没有任何额外成本的情况下使用,并且可以大大提高需要此类功能的内核的性能。



在早期的CUDA体系结构中,纹理和cudaArrays不能由核心。在计算能力> = 2.0的体系结构上,它们可以通过CUDA表面写入。



确定是否应该在全局内存中使用纹理或常规缓冲区,使用和访问模式。



您正在使用Fermi架构,其中的设备已重新命名为6xx系列。



对于Kepler架构的用户,请查看NVIDIA的开普勒演讲。特别是,纹理性能纹理缓存未锁定 const __restrict示例


I'm writing an image processing app where I have to fetch pixel data in uncoalesced manner.

Initially I implemented my algorithm using global memory. Later I reimplemented it using texture memory. To my amazement it became slower! I thought, maybe something wrong with cudaMalloc/text1Dfetch style, so I changed it to cudaArray/tex2D. Nothing changed.

Then I stumbled upon Shane Cook's "CUDA Programming", where he wrote:

As compute 1.x hardware has no cache to speak of, the 6–8K of texture memory per SM provides the only method to truly cache data on such devices. However, with the advent of Fermi and its up to 48 K L1 cache and up to 768 K shared L2 cache, this made the usage of texture memory for its cache properties largely obsolete. The texture cache is still present on Fermi to ensure backward compati- bility with previous generations of code.

I have GeForce GT 620M (Fermi, compute cap. 2.1).

So I need some advice from professionals! Should I dig deeper into texture memory with its texture cache trying to optimize performance? Or I should better stick with global memory and L1/L2 cache?

解决方案

Textures can indeed be useful on devices of compute capability >= 2.0.

Textures and cudaArrays can use memory stored in a space filling curve, which can allow for a better cache hit rate due to better 2D spatial locality.

The texture cache is separate from the other caches. So it has its own dedicated memory and bandwidth and reading from it does not interfere with the other caches. This can become important if there is a lot of pressure on your L1/L2 caches.

Textures also provide built in functionality such as interpolation, various addressing modes (clamp, wrap, mirror) and normalized addressing with floating point coordinates. These can be used without any extra cost and can greatly improve performance in kernels where such functionality is needed.

On early CUDA architectures, textures and cudaArrays could not be written by a kernel. On architectures of compute capability >= 2.0, they can be written via CUDA surfaces.

Determining if you should use textures or a regular buffer in global memory comes down to the intended usage and access patterns for the memory. It will be project specific.

You are using the Fermi architecture, with a device that has been rebranded into the 6xx series.

For those on the Kepler architecture, take a look at NVIDIA's Inside Kepler Presentation. In particular, the slides, Texture Performance, Texture Cache Unlocked and const __restrict Example.

这篇关于CUDA:纹理内存对于加快计算能力2.x和更新版本的访问时间仍然有用吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆