在OpenCL / CUDA中每个内存访问类型有多少个内存延迟周期? [英] How many memory latency cycles per memory access type in OpenCL/CUDA?

查看:482
本文介绍了在OpenCL / CUDA中每个内存访问类型有多少个内存延迟周期?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我浏览了编程指南和最佳实践指南,并提到全局内存访问需要400-600个周期。我没有看到很多其他内存类型,如纹理缓存,常量缓存,共享内存。寄存器具有0个内存延迟。



我认为常量缓存与寄存器相同,如果所有线程在常量缓存中使用相同的地址。最坏的情况我不太确定。



共享内存与寄存器相同,只要没有银行冲突?

解决方案



$ b

共享/恒定/纹理内存的延迟很小,取决于您使用的设备。通常,尽管GPU被设计为吞吐量架构,这意味着通过创建足够的线程,隐藏对存储器(包括全局存储器)的延迟。



指南介绍全局内存的延迟的原因是,延迟比其他内存的延迟高几个数量级,这意味着它是主要的延迟考虑进行优化。



您特别提到了常量缓存。你是相当正确的,如果一个warp(即32个线程组)中的所有线程访问相同的地址,那么没有惩罚,即值从缓存读取,并广播到所有线程同时。然而,如果线程访问不同的地址,则访问必须序列化,因为高速缓存一次只能提供一个值。如果您使用的是CUDA Profiler,则会显示在序列化计数器下。



与常规缓存不同,共享内存可以提供更高的带宽。查看 CUDA最佳化说明,了解更多详情和银行冲突的说明及其影响。


I looked through the programming guide and best practices guide and it mentioned that Global Memory access takes 400-600 cycles. I did not see much on the other memory types like texture cache, constant cache, shared memory. Registers have 0 memory latency.

I think constant cache is the same as registers if all threads use the same address in constant cache. Worst case I am not so sure.

Shared memory is the same as registers so long as there are no bank conflicts? If there are then how does the latency unfold?

What about texture cache?

解决方案

The latency to the shared/constant/texture memorys is small and depends on which device you have. In general though GPUs are designed as a throughput architecture which means that by creating enough threads the latency to the memorys, including the global memory, is hidden.

The reason the guides talk about the latency to global memory is that the latency is orders of magnitude higher than that of other memories, meaning that it is the dominent latency to be considered for optimization.

You mentioned constant cache in particular. You are quite correct that if all threads within a warp (i.e. group of 32 threads) access the same address then there is no penalty, i.e. the value is read from the cache and broadcast to all threads simultaneously. However, if threads access different addresses then the accesses must serialize since the cache can only provide one value at a time. If you're using the CUDA Profiler, then this will show up under the serialization counter.

Shared memory, unlike constant cache, can provide much higher bandwidth. Check out the CUDA Optimization talk for more details and an explanation of bank conflicts and their impact.

这篇关于在OpenCL / CUDA中每个内存访问类型有多少个内存延迟周期?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆