什么时候应该首选纹理内存而不是常量内存? [英] When should texture memory be prefered over constant memory?

查看:269
本文介绍了什么时候应该首选纹理内存而不是常量内存?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果线程之间的数据请求频率非常高(每个线程从特定列中至少选择一个数据),那么在常量内存中使用数据存储是否比Pascal架构中的纹理有任何好处?



编辑:这是>的拆分版本这个问题以改善社区搜索

解决方案

如果满足了对不断使用内存的期望,则使用在一般情况下,恒定内存是一个好主意。它允许您的代码利用GPU硬件提供的附加缓存机制,从而减轻代码其他部分对纹理的使用压力。

由于常量内存及其高速缓存(如纹理和表面内存以及它自己的高速缓存)由硬件计算能力,则应考虑目标硬件。因此,恒定内存和纹理内存的选择取决于访问模式和高速缓存的使用,因为高速缓存的可用性。



恒定内存的性能与数据广播之间有关。线程,因此如果所有线程都请求相同的数据地址并且数据已经在缓存中,则可以实现最高性能。因此,如果在同一个warp中有对多个地址的请求,则该服务将拆分为多个请求,因为它可以为每个操作检索单个地址。 如果由于从多个地址检索数据而导致的拆分请求数量过多,则在这种特定情况下,纹理和表面内存性能可能会优于恒定内存。 Cuda编程中对此信息进行了详细说明。指南


常量内存空间驻留在设备内存中,并缓存在计算能力2.x



然后,将一个请求分成多个请求,与初始请求中有
个不同的内存地址一样,将
的吞吐量减少等于



然后在高速缓存命中的情况下,以
恒定高速缓存的吞吐量或以设备
内存,否则。


纹理内存缓存比常量内存cac更灵活他。它可以利用以2D方式排列在一起的相同地址弯曲中的读数。 尽管与恒定内存相比有一些优势,但通常,如果数据访问模式或数据大小不符合恒定内存要求或不使用纹理内存缓存,则应使用纹理内存。可以找到,网址为


纹理和表面存储空间
驻留在设备内存中,并缓存在纹理缓存中,因此纹理
的获取或表面读取仅在
a缓存未命中时才从设备内存中读取一个内存,否则仅花费一次从纹理缓存中读取一次。
纹理缓存针对2D空间局部性进行了优化,因此
相同扭曲的线程在2D中读取接近
的纹理或表面地址将获得最佳性能。此外,它还设计用于具有恒定延迟的
流式获取。缓存命中会减少DRAM
的带宽需求,但不会减少获取延迟。



通过纹理或表面读取来读取设备内存会带来一些
的收益,可以使之受益从全局或常量内存中读取
设备内存的一种有利替代方案:




  • 如果内存读取不遵循以下访问模式,必须执行全局或
    恒定内存读取才能获得良好的性能,如果
    纹理读取或表面读取中存在局部性,则可以实现更高的
    带宽;

  • 寻址计算是在内核外部由专用单元执行的

  • 打包的数据可以通过
    广播以在单个操作中分离变量;

  • 8位和
    16位整数输入数据可以有选择地转换为[0.0,1.0]或[范围内的32位
    浮点值。 -1.0、1.0](请参阅
    纹理记忆)。


开发人员应记住利用纹理内存和恒定内存的组合可能比使用单个内存更具有真正的优势,因为它可以允许利用两者的专用缓存,因为这两个缓存的性能要高于在缓存之外检索的任何数据。缓存(即设备内存)。


Does the use of data storage in constant memory provides any benefit over texture in the Pascal architecture if the data request frequency is very high among threads (every thread pick at least one data from a specific column)?

EDIT: This is a split version of this question to improve community searching

解决方案

If the expectations for constant memory usage are satisfied, the use of constant memory is a good idea in the general case. It is allowing your code to take advantage of an additional cache mechanism provided by the GPU hardware, and in so doing putting less pressure on the usage of texture by other parts of your code.

Since the constant memory and its cache, as the texture and surface memory and it is own cache are defined by the hardware Compute Capability, the target hardware should be accounted. Thus the option by constant memory and texture memory is dependent of the access pattern and the cache use, as the cache availability.

The constant memory performance is related to data broadcast among threads in a warp, so the maximum performance is achieved if all threads request the very same data address and the data is already on the cache. Thus, if in the same warp there are request to multiple address, the service is splitted in multiple requests, since it can retrive a single address per operation. If the number of splitted requests due to data retrieval from multiple addresses is too high, the texture and surface memory performance may superior over constant memory in this specific situation.. This information is detailed in the Cuda Programming Guide:

The constant memory space resides in device memory and is cached in the constant cache mentioned in Compute Capability 2.x.

A request is then split into as many separate requests as there are different memory addresses in the initial request, decreasing throughput by a factor equal to the number of separate requests.

The resulting requests are then serviced at the throughput of the constant cache in case of a cache hit, or at the throughput of device memory otherwise.

The texture memory cache is more flexible than constant memory cache. It can take advantage of readings in the same warp of address that are close together in a 2D fashion. Despite of some advantages over constant memory, in general, the texture memory should be used if the data access pattern or the data size does not follow the constant memory requirements or to make use of texture memory cache. More detailed information can be found at:

The texture and surface memory spaces reside in device memory and are cached in texture cache, so a texture fetch or surface read costs one memory read from device memory only on a cache miss, otherwise it just costs one read from texture cache. The texture cache is optimized for 2D spatial locality, so threads of the same warp that read texture or surface addresses that are close together in 2D will achieve best performance. Also, it is designed for streaming fetches with a constant latency; a cache hit reduces DRAM bandwidth demand but not fetch latency.

Reading device memory through texture or surface fetching present some benefits that can make it an advantageous alternative to reading device memory from global or constant memory:

  • If the memory reads do not follow the access patterns that global or constant memory reads must follow to get good performance, higher bandwidth can be achieved providing that there is locality in the texture fetches or surface reads;
  • Addressing calculations are performed outside the kernel by dedicated units;
  • Packed data may be broadcast to separate variables in a single operation;
  • 8-bit and 16-bit integer input data may be optionally converted to 32 bit floating-point values in the range [0.0, 1.0] or [-1.0, 1.0] (see Texture Memory).

The developer should keep in mind that exploiting of the combination of texture memory with constant memory can be a real advantage over the preference for a single one, because it may allow to take advantage of the dedicated cache from both, since both caches have higher performance than over any data retrieved outside the cache (i.e. device memory).

这篇关于什么时候应该首选纹理内存而不是常量内存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆