何时对共享CUDA内存使用volatile [英] When to use volatile with shared CUDA Memory
问题描述
在什么情况下,您应该使用 volatile
关键字与CUDA内核的共享内存?我知道 volatile
告诉编译器从不缓存任何值,但我的问题是关于共享数组的行为:
Under what circumstances should you use the volatile
keyword with a CUDA kernel's shared memory? I understand that volatile
tells the compiler never to cache any values, but my question is about the behavior with a shared array:
__shared__ float products[THREADS_PER_ACTION];
// some computation
products[threadIdx.x] = localSum;
// wait for everyone to finish their computation
__syncthreads();
// then a (basic, ugly) reduction:
if (threadIdx.x == 0) {
float globalSum = 0.0f;
for (i = 0; i < THREADS_PER_ACTION; i++)
globalSum += products[i];
}
我需要产品
在这种情况下是易变的?每个数组条目只能由一个线程访问,除非在结束时,所有内容都由线程0读取。编译器可能缓存整个数组,因此我需要它 volatile
Do I need products
to be volatile in this case? Each array entry is only accessed by a single thread, except at the end, where everything is read by thread 0. Is it possible that the compiler could cache the entire array, and so I need it to be volatile
, or will it only cache elements?
推荐答案
如果不将共享数组声明为 volatile
,那么编译器可以通过将它们定位在寄存器范围是特定于单个线程),对于任何线程,在它选择。无论你是从一个线程访问特定的共享元素,这是真的。因此,如果你使用共享内存作为块的线程之间的通信工具,最好声明它 volatile
。
If you don't declare a shared array as volatile
, then the compiler is free to optimize locations in shared memory by locating them in registers (whose scope is specific to a single thread), for any thread, at it's choosing. This is true whether you access that particular shared element from only one thread or not. Therefore, if you use shared memory as a communication vehicle between threads of a block, it's best to declare it volatile
.
显然,如果每个线程只访问它自己的共享内存元素,而不是那些与另一个线程相关联的元素,那么这并不重要,编译器优化不会破坏任何东西。
Obviously if each thread only accessed its own elements of shared memory, and never those associated with another thread, then this does not matter, and the compiler optimization will not break anything.
在你的情况下,你有一段代码,每个线程访问它自己的共享内存元素,并且唯一的线程间访问发生在井您可以使用记忆障碍功能强制编译器驱逐临时存储在寄存器中的任何值,退回到共享数组。所以你可能认为 __ threadfence_block()
可能有用,但在你的情况下, __ syncthreads()
已内置内存防护功能。因此,您的 __ syncthreads()
调用足以强制线程同步,并强制共享内存中的任何寄存器缓存的值被回收到共享内存。
In your case, where you have a section of code where each thread is accessing it's own elements of shared memory, and the only inter-thread access occurs at a well understood location, you could use a memory fence function to force the compiler to evict any values that are temporarily stored in registers, back out to the shared array. So you might think that __threadfence_block()
might be useful, but in your case, __syncthreads()
already has memory-fencing functionality built in. So your __syncthreads()
call is sufficient to force thread synchronization as well as to force any register-cached values in shared memory to be evicted back to shared memory.
顺便说一句,如果代码结束时的减少是性能问题,您可以考虑使用并行缩减方法来加快速度。
By the way, if that reduction at the end of your code is of performance concern, you could consider using a parallel reduction method to speed it up.
这篇关于何时对共享CUDA内存使用volatile的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!