C11中的内存顺序消耗用法 [英] Memory order consume usage in C11

查看：78 发布时间：2020/5/13 21:12:27 c multithreading c11 stdatomic

本文介绍了C11中的内存顺序消耗用法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我了解到带有依赖关系，并且之前依赖关系排序，该关系在其定义5.1.2.4(p16)中使用了一个:

如果满足以下条件，则评估A在评估B之前是依相关性排序的:

— A对原子对象M执行释放操作，在另一个对象中线程，B在M上执行消耗操作，并读取写入的值在A或
开头的释放序列中有任何副作用
-对于某些评估X，A在X之前具有依赖关系顺序，并且X对B具有依赖关系.

因此，我尝试制作一个可能有用的示例.就是这样:

static _Atomic int i;

void *produce(void *ptr){
    int int_value = *((int *) ptr);
    atomic_store_explicit(&i, int_value, memory_order_release);
    return NULL;
}

void *consume(void *ignored){
    int int_value = atomic_load_explicit(&i, memory_order_consume);
    int new_int_value = int_value + 42;
    printf("Consumed = %d\n", new_int_value);
}

int main(int args, const char *argv[]){
    int int_value = 123123;
    pthread_t t2;
    pthread_create(&t2, NULL, &produce, &int_value);

    pthread_t t1;
    pthread_create(&t1, NULL, &consume, NULL);

    sleep(1000);
}

在函数void *consume(void*)中，int_value带有new_int_value的依赖项，因此，如果atomic_load_explicit(&i, memory_order_consume);读取由某些atomic_store_explicit(&i, int_value, memory_order_release);写入的值，则new_int_value计算 dependency-ordered-before atomic_store_explicit(&i, int_value, memory_order_release);.

但是，依赖顺序之前可以给我们带来什么有用的东西?

我目前认为memory_order_consume很可能会被memory_order_acquire替换而不会引起任何数据争用...

解决方案

consume比acquire更便宜.与acquire不同，所有CPU(除DEC Alpha AXP著名的弱内存模型¹之外)都是免费的.(x86和SPARC-TSO除外，这些硬件具有acq/rel内存排序没有额外的障碍或特殊说明.)

在ARM/AArch64/PowerPC/MIPS/etc等弱排序的ISA上，consume和relaxed是仅有的不需要任何额外障碍的命令，仅需普通的廉价装载指令即可.也就是说，除Alpha以外，所有asm加载指令均(至少)为consume加载. acquire需要LoadStore和LoadLoad排序，这是比seq_cst的全屏障便宜的屏障指令，但仍然比什么都昂贵.

mo_consume就像acquire一样，仅适用于数据依赖消耗负载的负载.例如float *array = atomic_ld(&shared, mo_consume);，那么如果生产者存储了缓冲区并且 then 使用mo_release存储区将指针写入共享变量，则访问任何array[i]是安全的.但是，独立的加载/存储不必等待consume加载完成，即使在以后按程序顺序出现它们也可以在加载之前发生.因此，consume仅订购最低限度的订单，不会影响其他负载或存储.

(对于大多数CPU设计，基本上免费在硬件中实现对consume语义的支持，因为OoO exec不能打破真正的依赖关系，并且负载对指针具有数据依赖关系，因此，加载指针然后对其进行取消引用仅是根据因果关系的本质对这两个加载进行排序，除非CPU进行值预测或疯狂的操作. 值预测就像分支预测一样，但是猜测要加载什么值，而不是分支将采用哪种方式.

Alpha必须做一些疯狂的事情，才能使CPU能够真正地从指针值真正加载之前开始加载数据，而此时存储是在有足够的障碍的情况下完成的.

与商店不同，商店缓冲区可以在商店执行和提交到L1d缓存之间引入重新排序，加载通过L1d缓存执行 (而不是退休+最终提交)时从L1d缓存中获取数据而变得可见".因此订购2个负载wrt.彼此之间的确只意味着按顺序执行这两个加载.由于数据相互依赖，因果关系要求在没有值预测的CPU上进行，而在大多数体系结构上，ISA规则确实要求这样做. 因此，您无需在加载+在asm中使用指针之间使用障碍，例如遍历链接列表.)

另请参见 CPU中的从属负载重新排序

，但是当前的编译器只是放弃并加强了`consume`到`acquire`

...而不是尝试将C依赖项映射到asm data 依赖项(不会意外中断，只有分支预测+投机执行可以绕开的控制依赖项).显然，对于编译器而言，跟踪它并使之变得安全是一个难题.

将C映射到asm并非易事，因为如果依赖项仅是条件分支的形式，则asm规则不适用.因此，很难仅以与asm ISA规则中携带依赖项"相符的方式来定义mo_consume传播依赖项的C规则.

所以是的，您正确地认为consume可以安全地替换为acquire，但是您完全没有注意这一点.

内存排序规则 do 弱的

ISA具有关于哪些指令带有依赖性的规则.因此，即使在体系结构上也要求像ARM eor r0,r0这样无条件置零r0的指令仍然携带对旧值的数据依赖关系，这与x86不同，在x86中，xor eax,eax习惯用法特别被认为是dependency-breaking ².

另请参见 http://preshing.com/20140709/cpp11/

中的memory_order_consume的用途

在原子操作std的答案中，我还提到了mo_consume: :atomic<>和写入顺序.

脚注1 :在理论上实际上可以违反因果关系"的少数Alpha模型并没有进行价值预测，其存储缓存的机制有所不同.我想我已经看到了关于可能性的更详细的解释，但是Linus关于它实际上是多么稀有的评论很有趣.

Linus Torvalds(Linux首席开发人员)，在RealWorldTech论坛主题中

我想知道，您是亲自还是在手册中看到Alpha上的非因果关系?

我自己从未见过它，而且我认为我从未拥有过任何模型访问实际上做到了.实际赚了(慢)人民币指令特别烦人，因为这只是纯粹的缺点.

即使在实际上可以重新排序负载的CPU上，显然在实践中根本无法实现.实际上是哪个真讨厌结果是糟糕，我忘记了障碍，但是一切工作十年很好，有三个奇怪的报告说不能发生来自现场的错误"之类的事情.弄清楚是什么继续只是痛苦的.

实际上有哪些型号?他们到底是怎么到达这里的?

我认为那是21264，我对它的记忆有些暗淡到分区缓存:即使原始CPU进行了两次写入顺序(中间有wmb)，则读取CPU可能最终会第一次写入延迟(因为它进入的缓存分区是忙于其他更新)，并且会先读取第二个写入内容.如果第二个写是第一个写的地址，然后它可以跟随该指针，并且没有读取障碍来同步缓存分区，它可能会看到旧的过时的值.

但是请注意暗淡记忆".我可能已经将它与其他东西混淆了. 到目前为止，我已经有近二十年没有实际使用过alpha了.你可以从价值预测中获得非常相似的效果，但是我不认为任何alpha微架构都可以做到这一点.

无论如何，肯定有一些Alpha版本可以这，而不仅仅是纯粹的理论.

(RMB =读取内存屏障asm指令，和/或Linux内核函数rmb()的名称，该名称包装了实现此目标所需的任何内联asm.例如，在x86上，这只是编译时重新排序的障碍，asm("":::"memory").我认为，与C11/C ++ 11不同，现代Linux设法在仅需要数据依赖时设法避免了获取障碍，但是我忘记了.Linux仅可移植到少数编译器中，并且那些编译器确实会注意支持Linux所依赖的东西，因此他们可以比ISO C11标准更轻松地整理出可以在实际ISA上实际使用的东西.)

另请参见 https://lkml.org/lkml/2012/2/1 /521 回复:Linux的smp_read_barrier_depends()，仅在Alpha中才需要在Linux中使用. (但 Hans Boehm 的回复指出，">编译器可以并且有时确实消除依赖项"，这就是为什么C11 memory_order_consume支持必须如此详尽以避免破损的风险.因此smp_read_barrier_depends可能很脆弱.)

脚注2 :x86对所有加载进行排序，无论它们是否对指针携带数据依赖项，因此它都不需要保留"false"依赖项，并且具有可变长度的指令集实际上将代码大小保存为xor eax,eax(2个字节)，而不是mov eax,0(5个字节).

因此xor reg,reg自8086年初开始成为标准的习惯用法，现在它像mov一样被识别和处理，而不再依赖于旧值或RAX. (实际上，除代码大小之外，比mov reg,0更有效:

In the function void *consume(void*) the int_value carries a dependency for new_int_value so if atomic_load_explicit(&i, memory_order_consume); reads a value written by some atomic_store_explicit(&i, int_value, memory_order_release); then new_int_value computation dependency-ordered-before the atomic_store_explicit(&i, int_value, memory_order_release);.

But what useful things can the dependency-ordered-before give us?

I currently think that the memory_order_consume may well be replaced with memory_order_acquire without causing any data race...

解决方案

consume is cheaper than acquire. All CPUs (except DEC Alpha AXP's famously weak memory model¹) do it for free, unlike acquire. (Except on x86 and SPARC-TSO, where the hardware has acq/rel memory ordering without extra barriers or special instructions.)

On ARM/AArch64/PowerPC/MIPS/etc weakly-ordered ISAs, consume and relaxed are the only orderings that don't require any extra barriers, just ordinary cheap load instructions. i.e. all asm load instructions are (at least) consume loads, except on Alpha. acquire requires LoadStore and LoadLoad ordering, which is a cheaper barrier instruction than a full-barrier for seq_cst, but still more expensive than nothing.

mo_consume is like acquire only for loads with a data dependency on the consume load. e.g. float *array = atomic_ld(&shared, mo_consume);, then access to any array[i] is safe if the producer stored the buffer and then used a mo_release store to write the pointer to the shared variable. But independent loads/stores don't have to wait for the consume load to complete, and can happen before it even if they appear later in program order. So consume only orders the bare minimum, not affecting other loads or stores.

(It's basically free to implement support for consume semantics in hardware for most CPU designs, because OoO exec can't break true dependencies, and a load has a data dependency on the pointer, so loading a pointer and then dereferencing it inherently orders those 2 loads just by the nature of causality. Unless CPUs do value-prediction or something crazy. Value prediction is like branch prediction, but guess what value is going to be loaded instead of which way a branch is going to go.

Alpha had to do some crazy stuff to make CPUs that could actually load data from before the pointer value was truly loaded, when the stores were done in order with sufficient barriers.

Unlike for stores, where the store buffer can introduce reordering between store execution and commit to L1d cache, loads become "visible" by taking data from L1d cache when they execute, not when the retire + eventually commit. So ordering 2 loads wrt. each other really does just mean executing those 2 loads in order. With a data dependency of one on the other, causality requires that on CPUs without value prediction, and on most architectures the ISA rules do specifically require that. So you don't have to use a barrier between loading + using a pointer in asm, e.g. for traversing a linked list.)

But current compilers just give up and strengthen `consume` to `acquire`

... instead of trying to map C dependencies to asm data dependencies (without accidentally breaking having only a control dependency that branch prediction + speculative execution could bypass). Apparently it's a hard problem for compilers to keep track of it and make it safe.

It's non-trivial to map C to asm, because if the dependency is only in the form of a conditional branch, the asm rules don't apply. So it's hard to define C rules for mo_consume propagating dependencies only in ways that line up with what does "carry a dependency" in terms of asm ISA rules.

So yes, you're correct that consume can be safely replaced with acquire, but you're totally missing the point.

ISAs with weak memory-ordering rules do have rules about which instructions carry a dependency. So even an instruction like ARM eor r0,r0 which unconditionally zeroes r0 is architecturally required to still carry a data dependency on the old value, unlike x86 where the xor eax,eax idiom is specially recognized as dependency-breaking².

I also mentioned mo_consume in an answer on Atomic operations, std::atomic<> and ordering of writes.

Footnote 1: The few Alpha models that actually could in theory "violate causality" didn't do value-prediction, there was a different mechanism with their banked cache. I think I've seen a more detailed explanation of how it was possible, but Linus's comments about how rare it actually was are interesting.

Linus Torvalds (Linux lead developer), in a RealWorldTech forum thread

I wonder, did you see non-causality on Alpha by yourself or just in the manual?

I never saw it myself, and I don't think any of the models I ever had access to actually did it. Which actually made the (slow) RMB instruction extra annoying, because it was just pure downside.

Even on CPU's that actually could re-order the loads, it was apparently basically impossible to hit in practice. Which is actually pretty nasty. It result in "oops, I forgot a barrier, but everything worked fine for a decade, with three odd reports of 'that can't happen' bugs from the field" kinds of things. Figuring out what's going on is just painful as hell.

Which models actually had it? And how exactly they got here?

I think it was the 21264, and I have this dim memory of it being due to a partitioned cache: even if the originating CPU did two writes in order (with a wmb in between), the reading CPU might end up having the first write delayed (because the cache partition that it went into was busy with other updates), and would read the second write first. If that second write was the address to the first one, it could then follow that pointer, and without a read barrier to synchronize the cache partitions, it could see the old stale value.

But note the "dim memory". I may have confused it with something else. I haven't actually used an alpha in closer to two decades by now. You can get very similar effects from value prediction, but I don't think any alpha microarchitecture ever did that.

Anyway, there definitely were versions of the alpha that could do this, and it wasn't just purely theoretical.

(RMB = Read Memory Barrier asm instruction, and/or the name of Linux kernel function rmb() that wraps whatever inline asm is necessary to make that happen. e.g. on x86, just a barrier to compile-time reordering, asm("":::"memory"). I think modern Linux manages to avoid an acquire barrier when only a data dependency is needed, unlike C11/C++11, but I forget. Linux is only portable to a few compilers, and those compilers do take care to support what Linux depends on, so they have an easier time than the ISO C11 standard in cooking up something that works in practice on real ISAs.)

See also https://lkml.org/lkml/2012/2/1/521 re: Linux's smp_read_barrier_depends() which is necessary in Linux only because of Alpha. (But a reply from Hans Boehm points out that "compilers can, and sometimes do, remove dependencies", which is why C11 memory_order_consume support needs to be so elaborate to avoid risk of breakage. Thus smp_read_barrier_depends is potentially brittle.)

Footnote 2: x86 orders all loads whether they carry a data dependency on the pointer or not, so it doesn't need to preserve "false" dependencies, and with a variable-length instruction set it actually saves code size to xor eax,eax (2 bytes) instead mov eax,0 (5 bytes).

So xor reg,reg became the standard idiom since early 8086 days, and now it's recognized and actually handled like mov, with no dependency on the old value or RAX. (And in fact more efficiently than mov reg,0 beyond just code-size: What is the best way to set a register to zero in x86 assembly: xor, mov or and?)

But this is impossible for ARM or most other weakly ordered ISAs, like I said they're literally not allowed to do this.

ldr r3, [something]       ; load r3 = mem
eor r0, r3,r3             ; r0 = r3^r3 = 0
ldr r4, [r1, r0]          ; load r4 = mem[r1+r0].  Ordered after the other load

is required to inject a dependency on r0 and order the load of r4 after the load of r3, even though the load address r1+r0 is always just r1 because r3^r3 = 0. But only that load, not all other later loads; it's not an acquire barrier or an acquire load.

这篇关于C11中的内存顺序消耗用法的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

C11中的内存顺序消耗用法 [英] Memory order consume usage in C11

问题描述

，但是当前的编译器只是放弃并加强了`consume`到`acquire`

But current compilers just give up and strengthen `consume` to `acquire`

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

C11中的内存顺序消耗用法 [英] Memory order consume usage in C11

问题描述

，但是当前的编译器只是放弃并加强了consume到acquire

But current compilers just give up and strengthen consume to acquire

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

，但是当前的编译器只是放弃并加强了`consume`到`acquire`

But current compilers just give up and strengthen `consume` to `acquire`

登录关闭