另一个CUDA纹理存储线程。 (为什么在Fermi上纹理存储应该更快?) [英] Yet Another CUDA Texture Memory Thread. (Why should texture memory be faster on Fermi?)

查看:67
本文介绍了另一个CUDA纹理存储线程。 (为什么在Fermi上纹理存储应该更快?)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有很多stackoverflow线程问为什么使用纹理的内核不比使用全局内存访问的内核快。
答案和评论对我来说似乎总是有些深奥。



,值得使用费米的纹理内存。


There are quite a few stackoverflow threads asking why a kernel using textures is not faster than one using global memory access. The answers and comments seem always a little bit esoteric to me.

The NVIDIA white paper on the Fermi architecture states black on white:

The Fermi architecture addresses this challenge by implementing a single unified memory request path for loads and stores, with an L1 cache per SM multiprocessor and unified L2 cache that services all operations (load, store and texture).

So why on earth should one expect any speed up from using texture memory on Fermi devices, since for every memory fetch (regardless wether it's bound to a texture or not) the same L2 cache is used. Actually for most cases direct access to global memory should be faster since it is also cached through L1 which a texture fetch isn't. This is also reported in a few related questions here on stackoverflow.

Can someone confirm this or show me what I'm missing?

解决方案

You are neglecting that each Streaming Multiprocessor has a texture cache (see the picture below illustrating a Streaming Multiprocessor for Fermi).

Texture cache has a different meaning than L1/L2 cache, since it is optimized for data locality. Data locality applies to all the cases when data concerning semantically (not physically) neighboring points of regular, Cartesian, 1D, 2D or 3D grids must be accessed. To better explain this concept, consider the following figure illustrating the stencil as involved in 2D or 3D finite difference calculations

Calculating finite differences at the red point involves accessing the data associated to the blue points. Now, these data aren't physical neighbors of the red points since they will not be physically stored consecutively in global memory when flattening the 2D or 3D array to 1D. However, they are semantical neighbors of the red points and texture memory is right good at caching these values. On the other side, L1/L2 caches are good when the same datum or its physical neighbors must be frequently accessed.

The other side of the medal is that texture cache as a higher latency as compared to L1/L2 cache, so, in some cases, not using texture may not lead to a significany worsening of the performance, just thanks to the L1/L2 caching mechanism. From this point of view, texture had top importance in the early CUDA architectures, when global memory reads were not cached. But, as demonstrated in Is 1D texture memory access faster than 1D global memory access?, texture memory for Fermi is worth to be used.

这篇关于另一个CUDA纹理存储线程。 (为什么在Fermi上纹理存储应该更快?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆