为什么在memoryBarrier不同步的情况下,barrier会同步共享内存? [英] Why does barrier synchronize shared memory when memoryBarrier doesn't?

查看:237
本文介绍了为什么在memoryBarrier不同步的情况下,barrier会同步共享内存?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下GLSL计算着色器仅将inImage复制到outImage.它来自更复杂的后处理过程.

The following GLSL compute shader simply copies inImage to outImage. It is derived from a more complex post-processing pass.

main()的前几行中,一个线程将64个像素的数据加载到共享数组中.然后,在同步之后,这64个线程中的每个线程都将一个像素写入输出图像.

In the first several lines of main(), a single thread loads 64 pixels of data into the shared array. Then, after synchronizing, each of the 64 threads writes one pixel to the output image.

根据我的同步方式,我会得到不同的结果.我本来以为memoryBarrierShared()是正确的调用,但它会产生以下结果:

Depending on how I synchronize, I get different results. I originally thought memoryBarrierShared() would be the correct call, but it produces the following result:

,这与不同步或使用memoryBarrier()相同.

which is the same result as having no synchronization or using memoryBarrier() instead.

如果使用barrier(),则会得到以下(期望的)结果:

If I use barrier(), I get the following (desired) result:

条带为32像素宽,如果我将工作组的大小更改为小于或等于32的任何值,我将得到正确的结果.

The striping is 32 pixels wide, and if I change the workgroup size to anything less than or equal to 32, I get correct results.

这是怎么回事?我误解了memoryBarrierShared()的目的吗?为什么barrier()应该起作用?

What's going on here? Am I misunderstanding the purpose of memoryBarrierShared()? Why should barrier() work?

#version 430

#define SIZE 64

layout (local_size_x = SIZE, local_size_y = 1, local_size_z = 1) in;

layout(rgba32f) uniform readonly  image2D inImage;
uniform writeonly image2D outImage;

shared vec4 shared_data[SIZE];

void main() {
    ivec2 base = ivec2(gl_WorkGroupID.xy * gl_WorkGroupSize.xy);
    ivec2 my_index = base + ivec2(gl_LocalInvocationID.x,0);

    if (gl_LocalInvocationID.x == 0) {
        for (int i = 0; i < SIZE; i++) {
            shared_data[i] = imageLoad(inImage, base + ivec2(i,0));
        }
    }

    // with no synchronization:   stripes
    // memoryBarrier();        // stripes
    // memoryBarrierShared();  // stripes
    // barrier();              // works

    imageStore(outImage, my_index, shared_data[gl_LocalInvocationID.x]);
}

推荐答案

图像加载存储和朋友的问题是,该实现不再能够确定着色器仅更改其专用输出值的数据(例如,帧缓冲)在片段着色器之后).这尤其适用于计算着色器,该着色器没有专用的输出,仅通过将数据写入可写存储(例如图像,存储缓冲区或原子计数器)来输出内容.这可能需要在各个遍之间进行手动同步,否则尝试访问纹理的片段着色器可能没有通过前一遍的图像存储操作将最新数据写入该纹理,例如您的计算着色器.

The problem with image load store and friends is, that the implementation cannot be sure anymore that a shader only changes the data of it's dedicated output values (e.g. the framebuffer after a fragment shader). This applies even more so to compute shaders, which don't have a dedicated output, but only output things by writing data into writable store, like images, storage buffers or atomic counters. This may require manual synchronization between individual passes as otherwise the fragment shader trying to access a texture might not have the most recent data written into that texture with image store operations by a preceding pass, like your compute shader.

因此,可能是您的计算着色器工作正常,但与以下显示(或其他)传递(需要以某种方式读取此图像数据)的同步失败.为此,存在> glMemoryBarrier 函数.根据您在显示遍历(或更确切地说是在计算着色器遍历之后读取图像的遍历)中读取图像数据的方式,需要为此函数赋予一个不同的标志.如果使用纹理读取它,请使用GL_TEXTURE_FETCH_BARRIER_BIT​,如果再次使用图像加载,请使用GL_SHADER_IMAGE_ACCESS_BARRIER_BIT​,如果使用glBlitFramebuffer进行显示,请使用GL_FRAMEBUFFER_BARRIER_BIT​ ...

So it may be that your compute shader works perfectly, but it is the synchronization with the following display (or whatever) pass (that needs to read this image data somehow) that fails. For this purpose there exists the glMemoryBarrier function. Depending on how you read that image data in the display pass (or more precisely the pass that reads the image after the compute shader pass), you need to give a different flag to this function. If you read it using a texture, use GL_TEXTURE_FETCH_BARRIER_BIT​, if you use an image load again, use GL_SHADER_IMAGE_ACCESS_BARRIER_BIT​, if using glBlitFramebuffer for display, use GL_FRAMEBUFFER_BARRIER_BIT​...

尽管我在图像加载/存储和手动内存同步方面没有太多经验,但这只是我从理论上得出的.因此,如果有人知道得更多,或者您已经使用了正确的glMemoryBarrier,请随时纠正我.同样,这也不必是您唯一的错误(如果有的话).但是链接的Wiki文章的最后两点实际上是针对您的用例的,恕我直言,很清楚,您需要某种glMemoryBarrier:

Though I don't have much experience with image load/store and manual memory snynchronization and this is only what I came up with theoretically. So if anyone knows better or you already use a proper glMemoryBarrier, then feel free to correct me. Likewise does this not need to be your only error (if any). But the last two points from the linked Wiki article actually address your use case and IMHO make it clear that you need some kind of glMemoryBarrier:

  • 在一次渲染过程中写入图像变量并在以后的过程中由着色器读取的数据不需要使用coherent变量,或者 memoryBarrier().用glMemoryBarrier调用 在传球之间设置的SHADER_IMAGE_ACCESS_BARRIER_BIT​

  • Data written to image variables in one rendering pass and read by the shader in a later pass need not use coherent variables or memoryBarrier(). Calling glMemoryBarrier with the SHADER_IMAGE_ACCESS_BARRIER_BIT​ set in barriers​ between passes is necessary.

着色器在一次渲染过程中写入的数据,并在以后的过程中需要另一种机制(例如,顶点或索引缓冲区拉取)读取的数据 不要使用coherent变量或memoryBarrier().呼唤 glMemoryBarrier并在两个之间的屏障中设置了适当的位 通行证是必须的.

Data written by the shader in one rendering pass and read by another mechanism (e.g., vertex or index buffer pulling) in a later pass need not use coherent variables or memoryBarrier(). Calling glMemoryBarrier with the appropriate bits set in barriers​ between passes is necessary.


编辑:实际上,有关计算着色器的Wiki文章


Actually the Wiki article on compute shaders says

共享变量访问使用规则来进行不连贯的内存访问. 这意味着用户必须按顺序执行某些同步 以确保共享变量可见.

Shared variable access uses the rules for incoherent memory access. This means that the user must perform certain synchronization in order to ensure that shared variables are visible.

共享变量都隐式声明为coherent​,因此您不必 需要(并且不能使用)该限定符.但是,您仍然需要 提供适当的内存屏障.

Shared variables are all implicitly declared coherent​, so you don't need to (and can't use) that qualifier. However, you still need to provide an appropriate memory barrier.

通常可以使用一组内存屏障来计算着色器,但是 他们还可以访问memoryBarrierShared()​; 专门用于共享变量排序. groupMemoryBarrier() 就像memoryBarrier()​一样,为所有类型的存储命令排序 变量,但仅对当前工作组的顺序进行排序.

The usual set of memory barriers is available to compute shaders, but they also have access to memoryBarrierShared()​; this barrier is specifically for shared variable ordering. groupMemoryBarrier()​ acts like memoryBarrier()​, ordering memory writes for all kinds of variables, but it only orders read/writes for the current work group.

虽然说工作组中的所有调用都是并行"执行的,但这并不意味着您可以假定所有调用都是 锁定执行.如果您需要确保调用具有 写入一些变量以便您可以读取它,您需要 与调用同步执行,而不仅仅是发出内存 屏障(尽管您仍然需要内存屏障).

While all invocations within a work group are said to execute "in parallel", that doesn't mean that you can assume that all of them are executing in lock-step. If you need to ensure that an invocation has written to some variable so that you can read it, you need to synchronize execution with the invocations, not just issue a memory barrier (you still need the memory barrier though).

要在工作组内的调用之间同步读写,必须使用barrier()功能.这迫使 工作组中所有调用之间的显式同步. 在所有其他情况下,工作组中的执行不会继续进行 调用已达到此障碍.一旦通过barrier()​,所有 先前在所有调用中编写的共享变量 组将可见.

To synchronize reads and writes between invocations within a work group, you must employ the barrier()​ function. This forces an explicit synchronization between all invocations in the work group. Execution within the work group will not proceed until all other invocations have reach this barrier. Once past the barrier()​, all shared variables previously written across all invocations in the group will be visible.

因此,这听起来好像您需要在那里的barrier,而memoryBarrierShared还是不够的(尽管您并不需要两者,正如最后一句话所说).内存屏障只会同步内存,但不会停止线程的执行.因此,如果第一个线程已经写入了某些内容,则这些线程将不会从共享内存中读取任何旧的缓存数据,但是它们可以很好地达到在读取之前 的地步.第一个线程试图写任何东西.

So this actually sounds like you need the barrier there and the memoryBarrierShared is not enough (though you don't need both, as the last sentence says). The memory barrier will just synchronize the memory, but it doesn't stop the execution of the threads to cross it. Thus the threads won't read any old cached data from the shared memory if the first thread has already written something, but they can very well reach the point of reading before the first thread has tried to write anything at all.

这实际上非常适合以下事实:对于32或以下的块大小,它起作用,并且前32个像素起作用.至少在NVIDIA硬件32上,翘曲的大小以及因此以完美锁定步长运行的线程数.因此,前32个线程(嗯,每个32个线程的块)总是完全并行地工作(嗯,从概念上讲就是这样),因此它们不能引入任何竞争条件.这也是为什么如果您知道自己在单个经纱(一种常见的优化)中工作,实际上并不需要任何同步的原因.

This actually fits perfectly to the fact that for 32 and below block sizes it works and that the first 32 pixels work. At least on NVIDIA hardware 32 is the warp size and thus the number of threads that operate in perfect lock-step. So the first 32 threads (well, every block of 32 threads) always work exactly parallel (well, conceptually that is) and thus they cannot introduce any race-conditions. This is also the case why you don't actually need any synchronization if you know you work inside a single warp, a common optimization.

这篇关于为什么在memoryBarrier不同步的情况下,barrier会同步共享内存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆