在OpenCL中,相对于barrier(),mem_fence()的作用是什么? [英] In OpenCL, what does mem_fence() do, as opposed to barrier()?

查看:280
本文介绍了在OpenCL中,相对于barrier(),mem_fence()的作用是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

barrier()(我想我理解)不同,mem_fence()不会影响工作组中的所有项目.对于mem_fence():

Unlike barrier() (which I think I understand), mem_fence() does not affect all items in the work group. The OpenCL spec says (section 6.11.10), for mem_fence():

命令加载和存储执行内核的工作项.

Orders loads and stores of a work-item executing a kernel.

(因此它适用于单个工作项).

(so it applies to a single work item).

但是,与此同时,在第3.3.1节中,它表示:

But, at the same time, in section 3.3.1, it says that:

在工作项内存中具有加载/存储一致性.

Within a work-item memory has load / store consistency.

一个工作项之内,内存是一致的.

so within a work item the memory is consistent.

那么mem_fence()对什么样的事情有用?它不适用于所有项目,但不需要在项目内...

So what kind of thing is mem_fence() useful for? It doesn't work across items, yet isn't needed within an item...

请注意,我没有使用原子操作(第9.5节等).是mem_fence()与这些结合使用的想法吗?如果是这样,我很乐意看到一个例子.

Note that I haven't used atomic operations (section 9.5 etc). Is the idea that mem_fence() is used in conjunction with those? If so, I'd love to see an example.

谢谢.

该规范,以供参考.

更新:我可以看到它与 barrier()一起使用时的用处(隐式地,因为障碍物调用了mem_fence())-但是肯定有更多,因为它是分开存在的?

Update: I can see how it is useful when used with barrier() (implicitly, since the barrier calls mem_fence()) - but surely there must be more, since it exists separately?

推荐答案

要(希望如此)更清楚地说明它,

To try to put it more clearly (hopefully),

mem_fence()等到工作组中的所有线程都可以看到在mem_fence()之前由调用工作项对本地和/或全局内存进行的所有读/写操作.

mem_fence() waits until all reads/writes to local and/or global memory made by the calling work-item prior to mem_fence() are visible to all threads in the work-group.

来自: http://developer.download. nvidia.com/presentations/2009/SIGGRAPH/asia/3_OpenCL_Programming.pdf

可以对内存操作进行重新排序以适合其运行的设备.规范指出(基本上),对内存操作的任何重新排序都必须确保内存在单个工作项中处于一致状态.但是,如果您(例如)执行存储操作并且值决定现在暂时驻留在工作项特定的缓存中,直到呈现出更好的时间来写入本地/全局内存,该怎么办?如果您尝试从该内存中加载,则写入该值的工作项会将其存储在其缓存中,因此没有问题.但是工作组中的其他工作项则没有,因此它们可能会读取错误的值.放置内存围墙可确保在调用内存围墙时,使本地/全局内存(根据参数)保持一致(刷新所有缓存,并且任何重新排序都将考虑到您预期其他线程可能会遇到的问题).在此之后需要访问此数据.)

Memory operations can be reordered to suit the device they are running on. The spec states (basically) that any reordering of memory operations must ensure that memory is in a consistent state within a single work-item. However, what if you (for example) perform a store operation and value decides to live in a work-item specific cache for now until a better time presents itself to write through to local/global memory? If you try to load from that memory, the work-item that wrote the value has it in its cache, so no problem. But other work-items within the work-group don't, so they may read the wrong value. Placing a memory fence ensures that, at the time of the memory fence call, local/global memory (as per the parameters) will be made consistent (any caches will be flushed, and any reordering will take into account that you expect other threads may need to access this data after this point).

我承认这仍然令人困惑,我不会发誓我的理解是100%正确的,但我认为至少这是一个普遍的想法.

I admit it is still confusing, and I won't swear that my understanding is 100% correct, but I think it is at least the general idea.

跟进:

我发现此链接谈论CUDA内存隔离网,但相同的基本思想也适用于OpenCL:

I found this link which talks about CUDA memory fences, but the same general idea applies to OpenCL:

http://developer.download. nvidia.com/compute/cuda/2_3/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.3.pdf

查看 B.5内存栅栏功能部分.

他们有一个代码示例,该示例计算一次调用中的数字数组的总和.设置代码以计算每个工作组中的部分和.然后,如果还有更多要做的工作,那么代码将由最后一个工作组来完成工作.

They have a code example that computes the sum of an array of numbers in one call. The code is set up to compute a partial sum in each work-group. Then, if there is more summing to do, the code has the last work-group do the work.

因此,每个工作组基本上要做两件事:部分和,更新全局变量,然后原子递增计数器全局变量.

So, basically 2 things are done in each work-group: A partial sum, which updates a global variable, then atomic increment of a counter global variable.

此后,如果还有其他工作要做,则将计数器递增到("work-group size"-1)值的工作组将作为最后一个工作组.该工作组继续完成工作.

After that, if there is any more work left to do, the work-group that incremented the counter to the value of ("work-group size" - 1) is taken to be the last work-group. That work-group goes on to finish up.

现在,问题(如他们所解释的)是,由于内存重新排序和/或缓存,计数器可能会增加,并且最后一个工作组可能会在该部分和全局变量之前开始工作.已将其最新值写入全局内存.

Now, the problem (as they explain it) is that, because of memory re-ordering and/or caching, the counter may get incremented and the last work-group may begin to do its work before that partial sum global variable has had its most recent value written to global memory.

内存栅栏将确保在通过栅栏之前,该部分和变量的值对于所有线程都是一致的.

A memory fence will ensure that the value of that partial sum variable is consistent for all threads before moving past the fence.

我希望这是有道理的.令人困惑.

I hope this makes some sense. It is confusing.

这篇关于在OpenCL中,相对于barrier(),mem_fence()的作用是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆