x86上std :: atomic_thread_fence(std :: memory_order_seq_cst)的实现,而没有额外的性能损失 [英] An implementation of std::atomic_thread_fence(std::memory_order_seq_cst) on x86 without extra performance penalties

查看:233
本文介绍了x86上std :: atomic_thread_fence(std :: memory_order_seq_cst)的实现,而没有额外的性能损失的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于为什么这个`std :: atomic_thread_fence`起作用的后续问题

由于虚拟互锁操作优于_mm_mfence,并且有很多方法可以实现,哪些互锁操作以及应使用哪些数据?

As a dummy interlocked operation is better than _mm_mfence, and there are quite many ways to implement it, which interlocked operation and on what data should be used?

假定使用的内联程序集不了解周围的上下文,但是可以告诉编译器将其注册为clobbers.

Assume using an inline assembly that is not aware of surrounding context, but can tell the compiler which registers it clobbers.

推荐答案

目前简短回答,而无需过多说明原因.具体参见评论中关于该链接问题的讨论.

Short answer for now, without going into too much detail about why. See specifically the discussion in comments on that linked question.

lock orb $0, -1(%rsp)可能是避免延长溢出/重新加载的本地var的依赖链的一个好选择.参见 https://shipilev.net/blog/2014/on-the -fence-with-dependencies (基准).在Windows x64(无红色区域)上,该空间应该不被使用,除非以后的调用或推送指令会如此.

lock orb $0, -1(%rsp) is probably a good bet to avoid lengthening dependency chains for local vars that get spilled/reloaded. See https://shipilev.net/blog/2014/on-the-fence-with-dependencies for benchmarks. On Windows x64 (no red zone), that space should be unused except by future call or push instructions.

将存储转发到 lock操作的负载侧可能是一件事情(如果最近使用了该空间),因此使锁定的操作保持窄是很好的.但是,作为一个完整的障碍,我不希望会有任何商店将其输出转发到其他任何东西,因此与正常情况不同,狭窄的(1字节)lock orb不会有不利的影响.

Store forwarding to the load side of a locked operation might be a thing (if that space was recently used), so keeping the locked operation narrow is good. But being a full barrier, I don't expect there can be any store forwarding from its output to anything else, so unlike normal, a narrow (1 byte) lock orb doesn't have that downside.

mfence还是很糟糕的,在Skylake上甚至会阻塞OoO执行程序,这可能更糟. (而且与lock add相比,对AMD也不利.)

mfence is pretty crap compared to a hot line of stack space even on Haswell, probably worse on Skylake where it even blocks OoO exec. (And also bad on AMD compared to lock add).

这篇关于x86上std :: atomic_thread_fence(std :: memory_order_seq_cst)的实现,而没有额外的性能损失的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆