何时对共享CUDA内存使用volatile [英] When to use volatile with shared CUDA Memory

查看:791
本文介绍了何时对共享CUDA内存使用volatile的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在什么情况下,您应该使用 volatile 关键字与CUDA内核的共享内存?我知道 volatile 告诉编译器从不缓存任何值,但我的问题是关于共享数组的行为:

Under what circumstances should you use the volatile keyword with a CUDA kernel's shared memory? I understand that volatile tells the compiler never to cache any values, but my question is about the behavior with a shared array:

__shared__ float products[THREADS_PER_ACTION];

// some computation
products[threadIdx.x] = localSum;

// wait for everyone to finish their computation
__syncthreads();

// then a (basic, ugly) reduction:
if (threadIdx.x == 0) {
    float globalSum = 0.0f;
    for (i = 0; i < THREADS_PER_ACTION; i++)
        globalSum += products[i];
}

我需要产品在这种情况下是易变的?每个数组条目只能由一个线程访问,除非在结束时,所有内容都由线程0读取。编译器可能缓存整个数组,因此我需要它 volatile

Do I need products to be volatile in this case? Each array entry is only accessed by a single thread, except at the end, where everything is read by thread 0. Is it possible that the compiler could cache the entire array, and so I need it to be volatile, or will it only cache elements?

推荐答案

如果不将共享数组声明为 volatile ,那么编译器可以通过将它们定位在寄存器范围是特定于单个线程),对于任何线程,在它选择。无论你是从一个线程访问特定的共享元素,这是真的。因此,如果你使用共享内存作为块的线程之间的通信工具,最好声明它 volatile

If you don't declare a shared array as volatile, then the compiler is free to optimize locations in shared memory by locating them in registers (whose scope is specific to a single thread), for any thread, at it's choosing. This is true whether you access that particular shared element from only one thread or not. Therefore, if you use shared memory as a communication vehicle between threads of a block, it's best to declare it volatile.

显然,如果每个线程只访问它自己的共享内存元素,而不是那些与另一个线程相关联的元素,那么这并不重要,编译器优化不会破坏任何东西。

Obviously if each thread only accessed its own elements of shared memory, and never those associated with another thread, then this does not matter, and the compiler optimization will not break anything.

在你的情况下,你有一段代码,每个线程访问它自己的共享内存元素,并且唯一的线程间访问发生在井您可以使用记忆障碍功能强制编译器驱逐临时存储在寄存器中的任何值,退回到共享数组。所以你可能认为 __ threadfence_block()可能有用,但在你的情况下, __ syncthreads() 已内置内存防护功能。因此,您的 __ syncthreads()调用足以强制线程同步,并强制共享内存中的任何寄存器缓存的值被回收到共享内存。

In your case, where you have a section of code where each thread is accessing it's own elements of shared memory, and the only inter-thread access occurs at a well understood location, you could use a memory fence function to force the compiler to evict any values that are temporarily stored in registers, back out to the shared array. So you might think that __threadfence_block() might be useful, but in your case, __syncthreads() already has memory-fencing functionality built in. So your __syncthreads() call is sufficient to force thread synchronization as well as to force any register-cached values in shared memory to be evicted back to shared memory.

顺便说一句,如果代码结束时的减少是性能问题,您可以考虑使用并行缩减方法来加快速度。

By the way, if that reduction at the end of your code is of performance concern, you could consider using a parallel reduction method to speed it up.

这篇关于何时对共享CUDA内存使用volatile的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆