CUDA:什么时候使用共享内存,什么时候依靠L1缓存? [英] CUDA: When to use shared memory and when to rely on L1 caching?

查看:310
本文介绍了CUDA:什么时候使用共享内存,什么时候依靠L1缓存?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

发布Compute Capability 2.0(Fermi)之后,我想知道共享内存是否还有用例。也就是说,什么时候使用共享内存比仅让L1在后台执行魔术更好?

After Compute Capability 2.0 (Fermi) was released, I've wondered if there are any use cases left for shared memory. That is, when is it better to use shared memory than just let L1 perform its magic in the background?

共享内存仅仅是在那里让设计用于CC的算法吗? 2.0无需修改即可有效运行?

Is shared memory simply there to let algorithms designed for CC < 2.0 run efficiently without modifications?

要通过共享内存进行协作,块中的线程将写入共享内存并与 __ syncthreads()。为什么不简单地(通过L1)写入全局内存,并与 __ threadfence_block()同步呢?后一种方法应该更容易实现,因为它不必与值的两个不同位置相关联,并且应该更快,因为没有从全局内存到共享内存的显式复制。由于数据被缓存在L1中,因此线程不必等待数据实际将其完全传送到全局内存中。

To collaborate via shared memory, threads in a block write to shared memory and synchronize with __syncthreads(). Why not simply write to global memory (through L1), and synchronize with __threadfence_block()? The latter option should be easier to implement since it doesn't have to relate to two different locations of values, and it should be faster because there is no explicit copying from global to shared memory. Since the data gets cached in L1, threads don't have to wait for data to actually make it all the way out to global memory.

对于共享内存,一个是确保在整个块持续时间内放置在此处的值保持在那里。这与L1中的值相反,如果L1中的值使用得不够频繁,则会将其逐出。在任何情况下,最好还是让这种很少使用的数据缓存在共享内存中,而不是让L1根据算法实际具有的使用模式来管理它们?

With shared memory, one is guaranteed that a value that was put there remains there throughout the duration of the block. This is as opposed to values in L1, which get evicted if they are not used often enough. Are there any cases where it's better too cache such rarely used data in shared memory than to let the L1 manage them based on the usage pattern that the algorithm actually has?

推荐答案

据我所知,GPU中的L1缓存的行为非常类似于CPU中的缓存。因此,您的评论这与L1中的值相反,如果不经常使用它们将被逐出对我来说没有太大意义

As far as i know, L1 cache in a GPU behaves much like the cache in a CPU. So your comment that "This is as opposed to values in L1, which get evicted if they are not used often enough" doesn't make much sense to me

如果不经常使用L1缓存,则不会将其清除。通常,当请求内存区域以前不在缓存中并且其地址解析为已在使用的内存区域时,就会将其驱逐。我不知道NVidia使用的确切的缓存算法,但是假设有规则的n向关联,则每个内存条目都只能根据其地址缓存在整个缓存的一小部分中。

Data on L1 cache isn't evicted when it isn't used often enough. Usually it is evicted when a request is made for a memory region that wasn't previously in cache, and whose address resolves to one that is already in use. I don't know the exact caching algorithm employed by NVidia, but assuming a regular n-way associative, then each memory entry can only be cached in a small subset of the entire cache, based on it's address

我想这也可以回答您的问题。使用共享内存,您可以完全控制存储在何处的内容,而使用高速缓存时,所有操作都是自动完成的。即使编译器和GPU在优化内存访问方面仍然非常聪明,但有时您仍然可以找到更好的方法,因为您是知道将提供哪些输入以及哪些线程将执行特定操作的人。当然)

I suppose this may also answer your question. With shared memory, you get full control as to what gets stored where, while with cache, everything is done automatically. Even though the compiler and the GPU can still be very clever in optimizing memory accesses, you can sometimes still find a better way, since you're the one who knows what input will be given, and what threads will do what (to a certain extent of course)

这篇关于CUDA:什么时候使用共享内存,什么时候依靠L1缓存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆