GLSL:关于连贯的限定词 [英] GLSL: about coherent qualifier

查看:162
本文介绍了GLSL:关于连贯的限定词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不太清楚coherent限定词和原子运算是如何一起工作的.

I didn't get clearly how coherent qualifier and atomic operations work together.

我使用以下代码在相同的SSBO位置上执行一些累加操作:

I perform some accumulating operation on the same SSBO location with this code:

uint prevValue, newValue;
uint readValue = ssbo[index];
do
{
    prevValue = readValue;
    newValue = F(readValue);
}
while((readValue = atomicCompSwap(ssbo[index], prevValue, newValue)) != prevValue);

此代码对我来说很好用,但在这种情况下,我是否还需要使用coherent限定符声明SSBO(或图像)?

This code works fine for me, but still, do I need to declare the SSBO (or Image) with coherent qualifier in this case?

在仅呼叫atomicAdd的情况下是否需要使用coherent?

And do I need to use coherent in a case when I call only atomicAdd?

我到底什么时候需要使用coherent限定词?我是否仅在直接写作:ssbo[index] = value;的情况下需要使用它?

When exactly do I need to use coherent qualifier? Do I need to use it only in case of direct writing: ssbo[index] = value;?

推荐答案

TL; DR

我发现证据支持关于coherent的两个答案.

当前分数:

  • 需要coherent的原子数:1.5
  • 省略coherent的原子:5.75
  • Requiring coherent with atomics: 1.5
  • Omitting coherent with atomics: 5.75

底线,尽管得分仍然不确定.在一个工作组中,我主要认为在实践中不需要coherent .在这些情况下,我不太确定:

Bottom line, still not sure despite the score. Inside a single workgroup, I'm mostly convinced coherent is not required in practice. I'm not so sure in these cases:

  1. glDispatchCompute
  2. 中的多个工作组
  3. 多个glDispatchCompute调用,所有调用(原子地)访问相同的内存位置,而它们之间没有任何glMemoryBarrier
  1. more than 1 workgroup in glDispatchCompute
  2. multiple glDispatchCompute calls that all access the same memory location (atomically) without any glMemoryBarrier between them

但是,当您仅通过原子操作访问SSBO(或单个结构成员)coherent时,会产生性能损失吗?基于以下内容,我不相信这是因为coherent在变量的读取或写入操作中添加了可见性"指令或指令标志.如果仅通过原子操作访问变量,则编译器希望:

However, is there a performance cost to declaring SSBOs (or individual struct members) coherent when you only access them through atomic operations? Based on what is below, I don't believe there is because coherent adds "visibility" instructions or instruction flags at the variable's read or write operations. If a variable is only accessed through atomic operations, the compiler should hopefully:

  1. 在生成原子指令时忽略coherent,因为它没有作用
  2. 使用适当的机制确保原子操作的结果在着色器调用,扭曲,工作组或渲染命令之外可见.
  1. ignore coherent when generating the atomic instructions because it has no effect
  2. use the appropriate mechanic to make sure the result of the atomic operation is visible outside the shader invocation, warp, workgroup or rendering command.

OpenGL Wiki的内存模型"页面:

From the OpenGL wiki's "Memory Model" page:

请注意,原子计数器在功能上与原子图像/缓冲区变量操作不同.后者仍需要连贯的限定词,障碍等.(于2020-04-12删除)

Note that atomic counters are different functionally from atomic image/buffer variable operations. The latter still need coherent qualifiers, barriers, and the like. (removed on 2020-04-12)

但是,如果以不连贯的方式修改了内存,则任何对该内存的后续读取将自动保证会看到这些更改.

However, if memory has been modified in an incoherent fashion, any subsequent reads from that memory are not automatically guaranteed to see these changes.

+1需要coherent

+1 for requiring coherent

// Fragment shader used bor ACB gets output color from a texture
#version 430 core

uniform sampler2D texUnit;
layout(binding = 0) uniform atomic_uint acb[ s(nCounters) ];
smooth in vec2 texcoord;
layout(location = 0) out vec4 fragColor;

void main()
{
    for (int i=0; i<  s(nCounters) ; ++i) atomicCounterIncrement(acb[i]);
    fragColor = texture(texUnit, texcoord);
}

// Fragment shader used for SSBO gets output color from a texture
#version 430 core

uniform sampler2D texUnit;
smooth in vec2 texcoord;
layout(location = 0) out vec4 fragColor;
layout(std430, binding = 0) buffer ssbo_data
{
    uint v[ s(nCounters) ];
};

void main()
{
    for (int i=0; i< s(nCounters) ; ++i) atomicAdd(v[i], 1);
    fragColor = texture(texUnit, texcoord);
}

请注意,第二个着色器中的ssbo_data没有声明为coherent.

Notice that ssbo_data in the second shader is not declared coherent.

文章还指出:

出于各种原因,OpenGL基金会建议在SSBO上使用[原子计数器缓冲区].但是,提高性能并不是其中之一.这是因为ACB在内部被实现为SSBO原子操作.因此,使用ACB并没有真正的性能优势.

The OpenGL foundation recommends using [atomic counter buffers] over SSBOs for various reasons; however improved performance is not one of them. This is because ACBs are internally implemented as SSBO atomic operations; therefore there are no real performance benefits from utilizing ACBs.

因此,显然,原子计数器实际上与SSBO相同. (但是这些各种原因"是什么,这些建议在哪里?英特尔是否暗示了一个阴谋支持原子计数器……?)

So atomic counters are actually the same thing as SSBOs apparently. (But what are those "various reasons" and where are those recommendations? Is Intel hinting at a conspiracy in favor of atomic counters...?)

+1表示省略了coherent

+1 for omitting coherent

GLSL规范在描述coherent和原子操作(重点是我的)时使用了不同的措辞:

The GLSL spec uses different wording when describing coherent and atomic operations (emphasis mine):

(4.10)当使用未声明为一致性的变量访问内存时,着色器访问的内存可能由实现缓存,以为将来对同一地址的访问提供服务.可以按以下方式缓存内存存储,即写入的值可能对访问同一内存的其他着色器调用不可见.该实现可以缓存由内存读取获取的值,并将相同的值返回给访问同一内存的任何着色器调用,即使自第一次读取内存以来已修改了基础内存.

(4.10) When accessing memory using variables not declared as coherent, the memory accessed by a shader may be cached by the implementation to service future accesses to the same address. Memory stores may be cached in such a way that the values written might not be visible to other shader invocations accessing the same memory. The implementation may cache the values fetched by memory reads and return the same values to any shader invocation accessing the same memory, even if the underlying memory has been modified since the first memory read.

(8.11)原子存储函数对存储在缓冲区对象或共享变量存储中的单个有符号或无符号整数执行原子操作.所有原子内存操作从内存中读取一个值,使用以下描述的操作之一计算新值,将新值写入内存,然后返回原始值读.在从原始值读取到写入新值之间的任何着色器调用中,保证通过原子操作更新的内存内容不会被任何其他赋值或原子存储功能修改.

(8.11) Atomic memory functions perform atomic operations on an individual signed or unsigned integer stored in buffer-object or shared-variable storage. All of the atomic memory operations read a value from memory, compute a new value using one of the operations described below, write the new value to memory, and return the original value read. The contents of the memory being updated by the atomic operation are guaranteed not to be modified by any other assignment or atomic memory function in any shader invocation between the time the original value is read and the time the new value is written.

本节中的所有内置函数都接受带有限制,一致性和易失性内存限定条件的参数,尽管原型中未列出这些参数. 原子操作将根据调用参数的内存限定条件而不是内置函数的形式参数内存限定条件进行操作.

All the built-in functions in this section accept arguments with combinations of restrict, coherent, and volatile memory qualification, despite not having them listed in the prototypes. The atomic operation will operate as required by the calling argument’s memory qualification, not by the built-in function’s formal parameter memory qualification.

因此,一方面原子操作应该直接与存储的内存一起工作(是否意味着绕过了可能的缓存?).另一方面,似乎内存限定(例如coherent)在原子操作中起着作用.

So on the one hand atomic operations are supposed to work directly with the storage's memory (does that imply bypassing possible caches?). On the other hand, it seems that memory qualifications (e.g. coherent) play a role in what the atomic operation does.

+0.5,表示需要coherent

+0.5 for requiring coherent

OpenGL 4.6规范在第7.13.1节着色器内存访问顺序"中更详细地说明了此问题.

The OpenGL 4.6 spec sheds more light on this issue in section 7.13.1 "Shader Memory Access Ordering"

内置的原子内存事务和原子计数器功能可用于自动读写给定的内存地址. 虽然由多个着色器调用发出的内置原子函数相对于彼此以未定义的顺序执行,但是这些函数执行存储器地址的读取和写入,并确保没有其他存储器事务将写入基础存储器读写之间.原子允许着色器将共享的全局地址用于相互排斥或用作计数器,以及其他用途.

The built-in atomic memory transaction and atomic counter functions may be used to read and write a given memory address atomically. While built-in atomic functions issued by multiple shader invocations are executed in undefined order relative to each other, these functions perform both a read and a write of a memory address and guarantee that no other memory transaction will write to the underlying memory between the read and write. Atomics allow shaders to use shared global addresses for mutual exclusion or as counters, among other uses.

那么原子操作的意图显然似乎一直都是原子 ,而不依赖于coherent限定词.确实,为什么要使用一种原子操作,而该原子操作在不同的着色器调用之间没有以某种方式结合在一起?通过多次调用增加本地缓存的值,并让所有这些最终最终写入一个完全独立的值是没有道理的.

The intent of atomic operations then clearly seems to be, well, atomic all the time and not depending on a coherent qualifier. Indeed, why would one want an atomic operation that isn't somehow combined between different shader invocations? Incrementing a locally cached value from multiple invocations and having all of them eventually write a completely independent value makes no sense.

+1表示省略了coherent

+1 for omitting coherent

OpenGL 4.6:原子计数器缓冲区是否需要使用glMemoryBarrier调用能够访问计数器?

OpenGL 4.6: Do atomic counter buffers require the use of glMemoryBarrier calls to be able to access the counter?

我们在OpenGL | ES会议上再次讨论了这一点.基于IHV的反馈及其原子计数器的实现,我们计划将它们像对待其他资源(如图像原子,图像加载/存储,缓冲区变量等)一样对待,因为它们需要与应用程序进行显式同步.规范将更改为在枚举其他资源的地方添加原子计数器".

We discussed this again in the OpenGL|ES meeting. Based on feedback from IHVs and their implementation of atomic counters we're planning to treat them like we treat other resources like image atomic, image load/store, buffer variables, etc. in that they require explicit synchronization from the application. The spec will be changed to add "atomic counters" to the places where the other resources are enumerated.

所描述的规格更改发生在OpenGL 4.5至4.6中,但与glMemoryBarrier有关,而glMemoryBarrier在单个glDispatchCompute内部没有任何作用.

The described spec change occurred in OpenGL 4.5 to 4.6, but relates to glMemoryBarrier which plays no part in inside a single glDispatchCompute.

无效

让我们检查由两个简单的着色器生成的装配,以了解实际情况.

Let's inspect the assembly produced by two simple shaders to see what happens in practice.

#version 460
layout(local_size_x = 512) in;

// Non-coherent qualified SSBO
layout(binding=0) restrict buffer Buf { uint count; } buf;

// Coherent qualified SSBO
layout(binding=1) coherent restrict buffer Buf_coherent { uint count; } buf_coherent;

void main()
{
  // First shader with atomics (v1)
  uint read_value1 = atomicAdd(buf.count, 2);
  uint read_value2 = atomicAdd(buf_coherent.count, 4);

  // Second shader with non-atomic add (v2)
  buf.count += 2;
  buf_coherent.count += 4;
}

第二个着色器用于比较coherent限定符在原子操作和非原子操作之间的效果.

The second shader is used to compare the effects of the coherent qualifier between atomic operations and non-atomic operations.

AMD发布了指令集体系结构(ISA)文档,该文档还与 Radeon GPU分析器可以深入了解GPU如何实际实现此功能.

AMD publishes Instruction Set Architecture (ISA) Documents which coupled with the Radeon GPU Analyzer gives insight into how GPUs actually implement this.

s_getpc_b64           s[0:1]                   BE801C80
s_mov_b32             s0, s2                   BE800002
s_mov_b64             s[2:3], exec             BE82017E
s_ff1_i32_b64         s4, exec                 BE84117E
s_lshl_b64            s[4:5], 1, s4            8E840481
s_and_b64             s[4:5], s[4:5], exec     86847E04
s_and_saveexec_b64    s[4:5], s[4:5]           BE842004
s_cbranch_execz       label_0010               BF880008
s_load_dwordx4        s[8:11], s[0:1], 0x00    C00A0200 00000000
s_bcnt1_i32_b64       s2, s[2:3]               BE820D02
s_mulk_i32            s2, 0x0002               B7820002
v_mov_b32             v0, s2                   7E000202
s_waitcnt             lgkmcnt(0)               BF8CC07F
buffer_atomic_add     v0, v0, s[8:11], 0       E1080000 80020000
label_0010:
s_mov_b64             exec, s[4:5]             BEFE0104
s_mov_b64             s[2:3], exec             BE82017E
s_ff1_i32_b64         s4, exec                 BE84117E
s_lshl_b64            s[4:5], 1, s4            8E840481
s_and_b64             s[4:5], s[4:5], exec     86847E04
s_and_saveexec_b64    s[4:5], s[4:5]           BE842004
s_cbranch_execz       label_001F               BF880008
s_load_dwordx4        s[8:11], s[0:1], 0x20    C00A0200 00000020
s_bcnt1_i32_b64       s0, s[2:3]               BE800D02
s_mulk_i32            s0, 0x0004               B7800004
v_mov_b32             v0, s0                   7E000200
s_waitcnt             lgkmcnt(0)               BF8CC07F
buffer_atomic_add     v0, v0, s[8:11], 0       E1080000 80020000
label_001F:
s_endpgm                                       BF810000

(不知道为什么在这里使用exec掩码和分支...)

(Don't know why the exec mask and branching is used here...)

我们可以看到,在Radeon GPU Analyzer的所有受支持的体系结构上,原子操作(在相干缓冲区和非相干缓冲区上)都会产生相同的指令:

We can see that both atomic operations (on coherent and non-coherent buffers) result in the same instruction on all supported architectures of the Radeon GPU Analyzer:

buffer_atomic_add     v0, v0, s[8:11], 0       E1080000 80020000

对该指令进行解码,显示GLC(全局相干)标志设置为0,这意味着对于原子操作:不返回先前的数据值.波前没有L1持久性".修改着色器以使用返回的值会将两者原子指令的GLC标志更改为1,这意味着:返回了先前的数据值.波前没有L1持久性".

Decoding this instruction shows that the GLC (Globally Coherent) flag is set to 0 which means for atomic operations: "Previous data value is not returned. No L1 persistence across wavefronts". Modifying the shader to use the returned values changes the GLC flag of both atomic instructions to 1 which means: "Previous data value is returned. No L1 persistence across wavefronts".

可追溯到2013年的文档(如海岛等)对BUFFER_ATOMIC_<op>说明进行了有趣的描述:

The documents dating from 2013 (Sea Islands, etc.) have an interesting description of the BUFFER_ATOMIC_<op> instructions:

缓冲区对象原子操作.始终在全球范围内保持一致.

Buffer object atomic operation. Always globally coherent.

因此在AMD硬件上,看来coherent对原子操作没有影响.

So on AMD hardware, it appears coherent has no effect for atomic operations.

s_getpc_b64           s[0:1]                   BE801C80
s_mov_b32             s0, s2                   BE800002
s_load_dwordx4        s[4:7], s[0:1], 0x00     C00A0100 00000000
s_waitcnt             lgkmcnt(0)               BF8CC07F
buffer_load_dword     v0, v0, s[4:7], 0        E0500000 80010000
s_load_dwordx4        s[0:3], s[0:1], 0x20     C00A0000 00000020
s_waitcnt             vmcnt(0)                 BF8C0F70
v_add_u32             v0, 2, v0                68000082
buffer_store_dword    v0, v0, s[4:7], 0 glc    E0704000 80010000
s_waitcnt             lgkmcnt(0)               BF8CC07F
buffer_load_dword     v0, v0, s[0:3], 0 glc    E0504000 80000000
s_waitcnt             vmcnt(0)                 BF8C0F70
v_add_u32             v0, 4, v0                68000084
buffer_store_dword    v0, v0, s[0:3], 0 glc    E0704000 80000000
s_endpgm                                       BF810000

coherent缓冲区上的buffer_load_dword操作使用glc标志,而另一个标志与预期不符.

The buffer_load_dword operation on the coherent buffer uses the glc flag and the other one does not as expected.

在AMD上: +1表示省略了coherent

On AMD: +1 for omitting coherent

可以通过检查glGetProgramBinary()返回的blob来获取着色器的组件. NV_gpu_program4

It's possible to get the assembly of a shader by inspecting the blob returned by glGetProgramBinary(). The instructions are described in NV_gpu_program4, NV_gpu_program5 and NV_gpu_program5_mem_extended.

!!NVcp5.0
OPTION NV_internal;
OPTION NV_shader_storage_buffer;
OPTION NV_bindless_texture;
GROUP_SIZE 512;
STORAGE sbo_buf0[] = { program.storage[0] };
STORAGE sbo_buf1[] = { program.storage[1] };
STORAGE sbo_buf2[] = { program.storage[2] };
TEMP R0;
TEMP T;
ATOMB.ADD.U32 R0.x, {2, 0, 0, 0}, sbo_buf0[0];
ATOMB.ADD.U32 R0.x, {4, 0, 0, 0}, sbo_buf1[0];
END

是否存在coherent都没有区别.

!!NVcp5.0
OPTION NV_internal;
OPTION NV_shader_storage_buffer;
OPTION NV_bindless_texture;
GROUP_SIZE 512;
STORAGE sbo_buf0[] = { program.storage[0] };
STORAGE sbo_buf1[] = { program.storage[1] };
STORAGE sbo_buf2[] = { program.storage[2] };
TEMP R0;
TEMP T;
LDB.U32 R0.x, sbo_buf0[0];
ADD.U R0.x, R0, {2, 0, 0, 0};
STB.U32 R0, sbo_buf0[0];
LDB.U32.COH R0.x, sbo_buf1[0];
ADD.U R0.x, R0, {4, 0, 0, 0};
STB.U32 R0, sbo_buf1[0];
END

coherent缓冲区上的LDB.U32操作使用COH修饰符,这意味着使LOAD和STORE操作使用一致的缓存".

The LDB.U32 operation on the coherent buffer uses the COH modifier which means "Make LOAD and STORE operations use coherent caching".

在NVIDIA上: +1表示省略了coherent

On NVIDIA: +1 for omitting coherent

让我们看看 glslang SPIR-V生成器生成的SPIR-V代码.

Let's see what SPIR-V code is generated by the glslang SPIR-V generator.

// Generated with glslangValidator.exe -H --target-env vulkan1.1
// Module Version 10300
// Generated by (magic number): 80008
// Id's are bound by 30

                              Capability Shader
               1:             ExtInstImport  "GLSL.std.450"
                              MemoryModel Logical GLSL450
                              EntryPoint GLCompute 4  "main"
                              ExecutionMode 4 LocalSize 512 1 1
                              Source GLSL 460
                              Name 4  "main"
                              Name 8  "read_value1"
                              Name 9  "Buf"
                              MemberName 9(Buf) 0  "count"
                              Name 11  "buf"
                              Name 20  "read_value2"
                              Name 21  "Buf_coherent"
                              MemberName 21(Buf_coherent) 0  "count"
                              Name 23  "buf_coherent"
                              MemberDecorate 9(Buf) 0 Restrict
                              MemberDecorate 9(Buf) 0 Offset 0
                              Decorate 9(Buf) Block
                              Decorate 11(buf) DescriptorSet 0
                              Decorate 11(buf) Binding 0
                              MemberDecorate 21(Buf_coherent) 0 Coherent
                              MemberDecorate 21(Buf_coherent) 0 Restrict
                              MemberDecorate 21(Buf_coherent) 0 Offset 0
                              Decorate 21(Buf_coherent) Block
                              Decorate 23(buf_coherent) DescriptorSet 0
                              Decorate 23(buf_coherent) Binding 1
                              Decorate 29 BuiltIn WorkgroupSize
               2:             TypeVoid
               3:             TypeFunction 2
               6:             TypeInt 32 0
               7:             TypePointer Function 6(int)
          9(Buf):             TypeStruct 6(int)
              10:             TypePointer StorageBuffer 9(Buf)
         11(buf):     10(ptr) Variable StorageBuffer
              12:             TypeInt 32 1
              13:     12(int) Constant 0
              14:             TypePointer StorageBuffer 6(int)
              16:      6(int) Constant 2
              17:      6(int) Constant 1
              18:      6(int) Constant 0
21(Buf_coherent):             TypeStruct 6(int)
              22:             TypePointer StorageBuffer 21(Buf_coherent)
23(buf_coherent):     22(ptr) Variable StorageBuffer
              25:      6(int) Constant 4
              27:             TypeVector 6(int) 3
              28:      6(int) Constant 512
              29:   27(ivec3) ConstantComposite 28 17 17
         4(main):           2 Function None 3
               5:             Label
  8(read_value1):      7(ptr) Variable Function
 20(read_value2):      7(ptr) Variable Function
              15:     14(ptr) AccessChain 11(buf) 13
              19:      6(int) AtomicIAdd 15 17 18 16
                              Store 8(read_value1) 19
              24:     14(ptr) AccessChain 23(buf_coherent) 13
              26:      6(int) AtomicIAdd 24 17 18 25
                              Store 20(read_value2) 26
                              Return
                              FunctionEnd

bufbuf_coherent之间的唯一区别是用MemberDecorate 21(Buf_coherent) 0 Coherent装饰后者.之后它们的用法是相同的.

The only difference between buf and buf_coherent is the decoration of the latter with MemberDecorate 21(Buf_coherent) 0 Coherent. Their usage afterwards is identical.

#pragma use_vulkan_memory_model添加到着色器将启用

Adding #pragma use_vulkan_memory_model to the shader enables the Vulkan memory model and produces these (abbreviated) changes:

                              Capability Shader
+                             Capability VulkanMemoryModelKHR
+                             Extension  "SPV_KHR_vulkan_memory_model"
               1:             ExtInstImport  "GLSL.std.450"
-                             MemoryModel Logical GLSL450
+                             MemoryModel Logical VulkanKHR
                              EntryPoint GLCompute 4  "main"

                              Decorate 11(buf) Binding 0
-                             MemberDecorate 21(Buf_coherent) 0 Coherent
                              MemberDecorate 21(Buf_coherent) 0 Restrict

这意味着...我不太清楚,因为我不熟悉Vulkan的复杂性.我确实找到了这个信息部分Vulkan 1.2规范中内存模型"附录的内容:

which means... I don't quite know because I'm not versed in Vulkan's intricacies. I did found this informative section of the "Memory Model" appendix in the Vulkan 1.2 spec:

尽管GLSL(和传统SPIR-V)将连贯"修饰应用于变量(出于历史原因),但该模型将每条存储器访问指令视为具有可选的隐式可用性/可见性操作.从GLSL到SPIR-V的编译器应将所有(非原子的)操作映射到该模型中的Make {Pointer,Texel} {Available} {Visible}标志的相干变量上.

While GLSL (and legacy SPIR-V) applies the "coherent" decoration to variables (for historical reasons), this model treats each memory access instruction as having optional implicit availability/visibility operations. GLSL to SPIR-V compilers should map all (non-atomic) operations on a coherent variable to Make{Pointer,Texel}{Available}{Visible} flags in this model.

原子操作暗含可用性/可见性操作,这些操作的范围取自原子操作的范围.

Atomic operations implicitly have availability/visibility operations, and the scope of those operations is taken from the atomic operation’s scope.

Shader v2

(跳过完整输出)

Shader v2

(skipping full output)

bufbuf_coherent之间的唯一区别还是MemberDecorate 18(Buf_coherent) 0 Coherent.

The only difference between buf and buf_coherent is again MemberDecorate 18(Buf_coherent) 0 Coherent.

#pragma use_vulkan_memory_model添加到着色器将启用

Adding #pragma use_vulkan_memory_model to the shader enables the Vulkan memory model and produces these (abbreviated) changes:

-                             MemberDecorate 18(Buf_coherent) 0 Coherent

-             23:      6(int) Load 22
-             24:      6(int) IAdd 23 21
-             25:     13(ptr) AccessChain 20(buf_coherent) 11
-                             Store 25 24
+             23:      6(int) Load 22 MakePointerVisibleKHR NonPrivatePointerKHR 24
+             25:      6(int) IAdd 23 21
+             26:     13(ptr) AccessChain 20(buf_coherent) 11
+                             Store 26 25 MakePointerAvailableKHR NonPrivatePointerKHR 24

请注意,添加了MakePointerVisibleKHRMakePointerAvailableKHR来控制指令级别而不是变量级别的操作一致性.

Notice the addition of MakePointerVisibleKHR and MakePointerAvailableKHR that control operation coherency at the instruction level instead of the variable level.

+1表示省略了coherent (也许?)

+1 for omitting coherent (maybe?)

CUDA的并行线程执行ISA部分工具包文档具有以下信息:

The Parallel Thread Execution ISA section of the CUDA Toolkit documentation has this information:

8.5.范围

8.5. Scope

每个强大的操作必须指定一个范围,该范围是可以与该操作直接交互并建立内存一致性模型中描述的任何关系的一组线程.有三个范围:

Each strong operation must specify a scope, which is the set of threads that may interact directly with that operation and establish any of the relations described in the memory consistency model. There are three scopes:

表18.范围

  • .cta:与当前线程在同一CTA中执行的所有线程的集合.
  • .gpu:当前程序中与当前线程在同一计算设备上执行的所有线程的集合.这还包括主机程序在同一计算设备上调用的其他内核网格.
  • .sys当前程序中所有线程的集合,包括由主机程序在所有计算设备上调用的所有内核网格以及构成主机程序本身的所有线程.
  • .cta: The set of all threads executing in the same CTA as the current thread.
  • .gpu: The set of all threads in the current program executing on the same compute device as the current thread. This also includes other kernel grids invoked by the host program on the same compute device.
  • .sys The set of all threads in the current program, including all kernel grids invoked by the host program on all compute devices, and all threads constituting the host program itself.

请注意,扭曲不是作用域; CTA是符合内存一致性模型范围的最小线程集合.

Note that the warp is not a scope; the CTA is the smallest collection of threads that qualifies as a scope in the memory consistency model.

关于CTA:

合作线程数组(CTA)是执行同一内核程序的一组并发线程.网格是一组独立执行的CTA.

A cooperative thread array (CTA) is a set of concurrent threads that execute the same kernel program. A grid is a set of CTAs that execute independently.

因此,按照GLSL术语,CTA ==工作组和网格== glDispatchCompute调用.

So in GLSL terms, CTA == work group and grid == glDispatchCompute call.

9.7.12.4.并行同步和通信指令:atom

9.7.12.4. Parallel Synchronization and Communication Instructions: atom

用于线程间通信的原子减少操作.

Atomic reduction operations for thread-to-thread communication.

[...]

可选的.scope限定符指定一组线程,这些线程可以直接观察此操作的内存同步效果,如内存一致性模型"中所述.

The optional .scope qualifier specifies the set of threads that can directly observe the memory synchronizing effect of this operation, as described in the Memory Consistency Model.

[...]

如果未指定范围,则原子操作使用.gpu范围执行.

If no scope is specified, the atomic operation is performed with .gpu scope.

因此,默认情况下,glDispatchCompute的所有着色器调用都将看到原子操作的结果...除非GLSL编译器生成使用cta范围的东西,在这种情况下,它只会在工作组内部可见.但是,后一种情况对应于shared GLSL变量,因此也许仅用于那些变量,而不用于SSBO操作. NVIDIA对这个过程不是很开放,所以我还没有找到一种确定的方法(也许使用glGetProgramBinary).但是,由于cta的语义映射到工作组,而gpu的语义映射到缓冲区(即SSBO,图像等),因此我声明:

So by default, all shader invocations of a glDispatchCompute would see the result of an atomic operation... unless the GLSL compiler generates something that uses the cta scope in which case it would only be visible inside the workgroup. This latter case however corresponds to shared GLSL variables so perhaps it's only used for those and not for SSBO operations. NVIDIA isn't very open about this process so I haven't found a way to tell for sure (perhaps with glGetProgramBinary). However, since the semantics of cta map to a work group and gpu to buffers (i.e. SSBO, images, etc), I declare:

+0.5表示省略了coherent

+0.5 for omitting coherent

我编写了一个粒子系统计算着色器,该着色器使用SSBO支持的变量作为atomicAdd()的操作数,并且可以工作.即使工作组大小为512,也不必使用coherent.但是,从来没有超过1个工作组.这主要是在Nvidia GTX 1080上进行了测试,因此,在NVIDIA上进行的原子操作似乎在工作组中至少总是可见的.

I have written a particle system compute shader that uses an SSBO backed variable as an operand to atomicAdd() and it works. Usage of of coherent was not necessary even with a work group size of 512. However, there was never more than 1 work group. This was tested mainly on an Nvidia GTX 1080 so as seen above, atomic operations on NVIDIA seem to always be at least visible inside the work group.

+0.25表示省略了coherent

+0.25 for omitting coherent

这篇关于GLSL:关于连贯的限定词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆