何时将 volatile 与共享 CUDA 内存一起使用 [英] When to use volatile with shared CUDA Memory

查看:25
本文介绍了何时将 volatile 与共享 CUDA 内存一起使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在什么情况下应该将volatile关键字与CUDA内核的共享内存一起使用?我知道 volatile 告诉编译器永远不要缓存任何值,但我的问题是关于共享数组的行为:

Under what circumstances should you use the volatile keyword with a CUDA kernel's shared memory? I understand that volatile tells the compiler never to cache any values, but my question is about the behavior with a shared array:

__shared__ float products[THREADS_PER_ACTION];

// some computation
products[threadIdx.x] = localSum;

// wait for everyone to finish their computation
__syncthreads();

// then a (basic, ugly) reduction:
if (threadIdx.x == 0) {
    float globalSum = 0.0f;
    for (i = 0; i < THREADS_PER_ACTION; i++)
        globalSum += products[i];
}

在这种情况下,我是否需要 products 是 volatile 的?每个数组条目只能由单个线程访问,除了最后,所有内容都由线程 0 读取.编译器是否有可能缓存整个数组,所以我需要它是 volatile,还是只缓存元素?

Do I need products to be volatile in this case? Each array entry is only accessed by a single thread, except at the end, where everything is read by thread 0. Is it possible that the compiler could cache the entire array, and so I need it to be volatile, or will it only cache elements?

谢谢!

推荐答案

如果你不将共享数组声明为 volatile,那么编译器可以通过定位它们来自由优化共享内存中的位置在寄存器中(其范围特定于单个线程),对于任何线程,由它选择.无论您是否仅从一个线程访问该特定共享元素,这都是正确的.因此,如果您使用共享内存作为块的线程之间的通信工具,最好将其声明为 volatile.但是,这种通信模式通常还需要执行障碍来强制读取/写入的顺序,因此请继续阅读下面的障碍.

If you don't declare a shared array as volatile, then the compiler is free to optimize locations in shared memory by locating them in registers (whose scope is specific to a single thread), for any thread, at it's choosing. This is true whether you access that particular shared element from only one thread or not. Therefore, if you use shared memory as a communication vehicle between threads of a block, it's best to declare it volatile. However, this sort of communication pattern often also requires execution barriers to enforce ordering of reads/writes, so continue reading about barriers below.

显然,如果每个线程只访问它自己的共享内存元素,而不访问与另一个线程关联的元素,那么这无关紧要,编译器优化不会破坏任何东西.

Obviously if each thread only accessed its own elements of shared memory, and never those associated with another thread, then this does not matter, and the compiler optimization will not break anything.

在您的情况下,您有一段代码,其中每个线程都在访问它自己的共享内存元素,并且唯一的线程间访问发生在一个很好理解的位置,您可以使用 内存栅栏函数强制编译器驱逐任何临时存储在寄存器中的值,返回到共享数组.所以你可能认为 __threadfence_block() 可能有用,但在你的情况下, __syncthreads() 已经内置了内存防护功能.因此,您的 __syncthreads() 调用足以强制线程同步以及强制将共享内存中的任何寄存器缓存值驱逐回共享内存.

In your case, where you have a section of code where each thread is accessing it's own elements of shared memory, and the only inter-thread access occurs at a well understood location, you could use a memory fence function to force the compiler to evict any values that are temporarily stored in registers, back out to the shared array. So you might think that __threadfence_block() might be useful, but in your case, __syncthreads() already has memory-fencing functionality built in. So your __syncthreads() call is sufficient to force thread synchronization as well as to force any register-cached values in shared memory to be evicted back to shared memory.

顺便说一句,如果代码末尾的缩减涉及性能问题,您可以考虑使用并行缩减方法来加快速度.

By the way, if that reduction at the end of your code is of performance concern, you could consider using a parallel reduction method to speed it up.

这篇关于何时将 volatile 与共享 CUDA 内存一起使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆