对GPU上非原子写入的保证是否很弱? [英] Weak guarantees for non-atomic writes on GPUs?

查看:70
本文介绍了对GPU上非原子写入的保证是否很弱?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

OpenCL和CUDA包含原子操作已经有好几年了(尽管显然不是每个CUDA或OpenCL设备都支持原子操作).但是-我的问题是,由于非原子性的写作,种族可能生活在一起".

假设网格中的多个线程都写入全局内存中的同一位置.我们是否保证在内核执行结束后,这些写入之一的结果将出现在该位置,而不是出现垃圾?

此问题的相关参数(选择任何组合,编辑,除了nVIDIA + CUDA以外,它已经得到了答案)

  • 内存空间:仅全局内存 ;这个问题与本地/共享/私有内存无关.
  • 对齐方式:在单个内存写入宽度内(例如nVIDIA GPU上为128位)
  • GPU制造商:AMD/nVIDIA
  • 编程框架:CUDA/OpenCL
  • 代码中存储指令的位置:所有线程的同一行代码/不同代码的行.
  • 写入目标:固定地址/相对于功能参数地址的固定偏移量/完全动态
  • 写入宽度:8/32/64位.

解决方案

我们是否保证在内核执行结束后,这些写操作之一的结果将出现在该位置,而不是出现一些垃圾?

对于当前的CUDA GPU,我非常确定带有OpenCL的NVIDIA GPU,答案是肯定的.我下面的大多数术语都包含CUDA.如果您对CUDA和OpenCL都需要详尽的答案,请告诉我,我将删除该答案. 与此问题非常相似的问题已被提出并得到了回答,无论如何. 这里是另一个,我敢肯定还有其他人.

当同一位置发生多次同时"写入时,其中之一将完好无损.

谁会赢还不确定.非中奖写入的行为也是不确定的(它们可能发生,但被中奖者取代,或者根本没有发生.)存储器位置的实际内容可能会通过各种值(例如原始值)传递. ,再加上任何有效的写入值),但过渡不会通过垃圾"值(即尚未存在且未被任何线程写入的值).过渡最终以赢家"结束.

示例1:

位置X包含零.线程1,5,32、30000和450000都将一个写入该位置.如果没有其他写入该位置的流量,则该位置最终将包含值1(在内核终止或更早的时间).

示例2:

位置X包含5.线程32向X写1.线程90303向X写7.线程432322向X写972.如果到该位置没有其他写流量,则在内核终止或更早版本时,位置X将包含1、7或972.它将不包含任何其他值,包括5.

我假设X在全局内存中,并且与它的所有通信都自然对齐,并且与它的所有通信都具有相同的大小,尽管这些原理也适用于共享内存.我还假设您没有违反CUDA编程原则,例如自然对齐的流量到设备内存位置.我在这里看到的事务是源自单个SASS指令(每个线程)的那些事务.此类事务的宽度可以为1,2、4或8个字节.无论写入是源自同一行代码"还是不同行",我在这里提出的主张都适用.

这些声明基于PTX内存一致性模型,因此正确性"是由GPU硬件而不是由编译器,CUDA编程模型或CUDA所基于的C ++标准来确保的.

这是一个相当复杂的主题(尤其是当我们考虑缓存行为时,以及在混合中引发读取时的期望值),但是垃圾"值应该永远不会出现.全局内存中应该出现的唯一值是那些从那里开始的值,或者是某个线程在某个地方写入的值.

OpenCL and CUDA have included atomic operations for several years now (although obviously not every CUDA or OpenCL device supports these). But - my question is about the possibility of "living with" races due to non-atomic writes.

Suppose several threads in a grid all write to the same location in global memory. Are we guaranteed that, when kernel execution has concluded, the results of one of these writes will be present in that location, rather than some junk?

Relevant parameters for this question (choose any combination(s), edit except nVIDIA+CUDA which already got an answer):

  • Memory space: Global memory only; this question is not about local/shared/private memory.
  • Alignment: Within a single memory write width (e.g. 128 bits on nVIDIA GPUs)
  • GPU Manufacturer: AMD / nVIDIA
  • Programming framework: CUDA / OpenCL
  • Position of store instruction in code: Same line of code for all threads / different lines of code.
  • Write destination: Fixed address / fixed offset from the address of a function parameter / completely dynamic
  • Write width: 8 / 32 / 64 bits.

解决方案

Are we guaranteed that, when kernel execution has concluded, the results of one of these writes will be present in that location, rather than some junk?

For current CUDA GPUs, and I'm pretty sure for NVIDIA GPUs with OpenCL, the answer is yes. Most of my terminology below will have CUDA in view. If you require an exhaustive answer for both CUDA and OpenCL, let me know, and I'll delete this answer. Very similar questions to this one have been asked, and answered, before anyway. Here's another, and I'm sure there are others.

When multiple "simultaneous" writes occur to the same location, one of them will win, intact.

Which one will win is undefined. The behavior of the non-winning writes is also undefined (they may occur, but be replaced by the winner, or they may not occur at all.) The actual contents of the memory location may transit through various values (such as the original value, plus any of the valid written values), but the transit will not pass through "junk" values (i.e. values that were not already there and were not written by any thread.) The transit ends up at the "winner", eventually.

Example 1:

Location X contains zero. Threads 1,5,32, 30000, and 450000 all write one to that location. If there is no other write traffic to that location, that location will eventually contain the value of one (at kernel termination, or earlier).

Example 2:

Location X contains 5. Thread 32 writes 1 to X. Thread 90303 writes 7 to X. Thread 432322 writes 972 to X. If there is no other write traffic to that location, upon kernel termination, or earlier, location X will contain either 1, 7 or 972. It will not contain any other value, including 5.

I'm assuming X is in global memory, and all traffic to it is naturally aligned to it, and all traffic to it is of the same size, although these principles apply to shared memory as well. I'm also assuming you have not violated CUDA programming principles, such as the requirement for naturally aligned traffic to device memory locations. The transactions I have in view here are those transactions that originate from a single SASS instruction (per thread) Such transactions can have a width of 1,2,4,or 8bytes. The claims I've made here apply whether the writes are originating from "the same line of code" or "different lines".

These claims are based on the PTX memory consistency model, and therefore the "correctness" is ensured by the GPU hardware, not by the compiler, the CUDA programming model, or the C++ standard that CUDA is based on.

This is a fairly complex topic (especially when we factor in cache behavior, and what to expect when we throw reads in the mix), but "junk" values should never occur. The only values that should occur in global memory are those values that were there to begin with, or those values that were written by some thread, somewhere.

这篇关于对GPU上非原子写入的保证是否很弱?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆