当86 LFENCE,SFENCE和MFENCE指令所需? [英] When are x86 LFENCE, SFENCE and MFENCE instructions required?

查看:179
本文介绍了当86 LFENCE,SFENCE和MFENCE指令所需?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

好吧,我已经从SO有关的x86 CPU围栏(阅读下面的QS LFENCE SFENCE MFENCE


  • <一个href=\"http://stackoverflow.com/questions/20316124/does-it-make-any-sense-instruction-lfence-in-processors-x86-x86-64\">Does它使任何意义指令LFENCE在处理器的x86 / x86_64的?


  • <一个href=\"http://stackoverflow.com/questions/20326280/what-is-the-impact-sfence-and-lfence-to-caches-of-neighboring-cores/20329574#20329574\">What是影响SFENCE和LFENCE邻国内核缓存?


  • <一个href=\"http://stackoverflow.com/questions/27522190/is-the-mesi-protocol-enough-or-are-memory-barriers-still-required-intel-cpus\">Is MESI协议就够了,或者是记忆障碍仍需要? (英特尔CPU)



  • <一个href=\"http://www.puppetmastertrading.com/images/hwViewForSwHackers.pdf\">http://www.puppetmastertrading.com/images/hwViewForSwHackers.pdf


  • <一个href=\"https://onedrive.live.com/view.aspx?resid=4E86B0CF20EF15AD!24884&app=WordPdf&authkey=!AMtj_EflYn2507c\">https://onedrive.live.com/view.aspx?resid=4E86B0CF20EF15AD!24884&app=WordPdf&authkey=!AMtj_EflYn2507c


和我必须说实话,我现在还不能完全确定时,需要一个栅栏。我试图从完全去除吹锁,并试图通过栅栏使用更细粒度锁,以减少等待时间延迟的角度去理解。

首先这里有两个具体的问题我不明白:

有时候做一个店时,CPU会写入其存储缓冲区,而不是L1缓存。我不明白不过的条款在其上CPU将做到这一点?

CPU2不妨加载已写入到CPU1的存储缓冲区的值。据我了解,问题是CPU2不能看到CPU1的存储缓冲的新值。为什么不能MESI协议只包括冲洗存储缓冲区作为其协议??的一部分

更一般地,可以请人试图描述的总体方案,并有助于解释,当 LFENCE / MFENCE SFENCE 需要说明?

NB的问题,围绕这一主题阅读一个是写一般多CPU架构文章的数量,当我只在英特尔x86-64架构特别感兴趣。


解决方案

最简单的答案是:你必须使用3围栏一( LFENCE SFENCE MFENCE )提供6个数据一致性之一:


  • 轻松

  • 消费

  • 获取

  • 发布

  • 收购释

  • 连续

C ++ 11:

首先,你应该考虑从视图的内存访问,这是有据可查的,并在C ++ 11规范有序度的角度来看这个问题。你应该先阅读:<一href=\"http://en.cp$p$pference.com/w/cpp/atomic/memory_order\">http://en.cp$p$pference.com/w/cpp/atomic/memory_order

86 / x86_64的:

1。收购-释放一致性:,然后,要明白,在 86 以获得常规RAM(默认为WB标记是很重要的 - 写回,并与WT相同的效果(写Throught)或UC(不可缓存的))通过使用ASM MOV 无需任何额外的命令自动提供的内存以便采集,释放一致性 - 的std :: memory_order_acq_rel
即这个内存是有道理只使用的std :: memory_order_seq_cst 只为提供顺序一致性。即当你正在使用:的std :: memory_order_relaxed 的std :: memory_order_acq_rel 然后编译汇编code代表的std ::原子::店()(或的std ::原子::负载())将成为一样的 - 只有 MOV 无任何 L / S / MFENCE

注:的但是你要知道,这不仅是CPU,但和C ++ - 编译器可以重新排序操作内存,所有6个内存壁垒始终在C ++的影响 - 编译器无论CPU架构。

然后,你必须知道,怎么能够从C ++到ASM(本机code)或怎么可以把它写在汇编编译。提供任何一致性排除顺序,你可以简单的写 MOV ,例如 MOV章,[地址] MOV [地址],章

2。连续一致性:而是提供顺序一致性,必须使用隐式的(锁定)或明确围栏(L / S / MFENCE ),如下所述:<一href=\"http://stackoverflow.com/questions/19047327/why-gcc-does-not-use-loadwithout-fence-and-storesfence-for-stdmemory-order\">Why GCC不使用LOAD(无围墙),并存储+ SFENCE的顺序一致性?


  1. LOAD (无栅栏)和商店 + MFENCE

  2. LOAD (无栅栏)和 LOCK XCHG

  3. MFENCE + LOAD 商店(无栅栏)

  4. LOCK XADD (0)和商店(无栅栏)

例如,GCC使用1,但MSVC使用2的(但你要知道,MSVS2012有一个错误:<一href=\"http://stackoverflow.com/questions/18576986/does-the-semantics-of-stdmemory-order-acquire-requires-processor-instruction\">Does `的std :: memory_order_acquire`的语义需要在x86处理器指令/ x86_64的?)

然后,您可以阅读香草萨特,你的链接:<一个href=\"https://onedrive.live.com/view.aspx?resid=4E86B0CF20EF15AD!24884&app=WordPdf&authkey=!AMtj_EflYn2507c\">https://onedrive.live.com/view.aspx?resid=4E86B0CF20EF15AD!24884&app=WordPdf&authkey=!AMtj_EflYn2507c

规则的例外:

这规则是通过使用 MOV 来标记默认为WB常规RAM访问真实的 - 回写。内存标志着页表,在每个PTE(页表Enrty)的每一页(4 KB连续内存)。

但也有一些例外:


  1. 如果我们在页表,标志着内存写入联合( ioremap_wc()在POSIX),然后automaticaly仅提供了获取一致性,我们必须充当下面的段落。


  2. 请参阅回答我的问题:<一href=\"http://stackoverflow.com/a/27302931/1558037\">http://stackoverflow.com/a/27302931/1558037



  

      
  • 写入存储器时,不与其他写重新排序,用以下例外
      
      

        
    • 写入与CLFLUSH指令执行;

    •   
    • 流店(写入)与非颞移动指令(MOVNTI,MOVNTQ,MOVNTDQ,MOVNTPS和MOVNTPD)执行;和

    •   
    • 字符串操作(见8.2.4.1)。

    •   

  •   

在这两种情况下1和; 2,你必须使用其他 SFENCE 两次写入,即使你想获取-释放一致性,因为这里automaticaly只提供了获取一致性,你必须做发行(同一个地址之间的 SFENCE )自己。

回答您的两个问题:


  

有时候做一个店时,CPU会写入其存储缓冲区
  代替L1高速缓存。我不明白不过的条款
  其中一个CPU会做到这一点?


从视图的用户的高速缓存L1和存储缓冲行为不同的点。 L1快,但存储缓冲区更快。


  • 存储缓冲区 - 是一个简单的队列,其中只存储写入,并且无法重新排序 - 它是为业绩增长做出和隐藏的访问延迟缓存(L1 - 为1ns,L2 - 为3ns,L3 - 为10ns)(CPU核心认为写入已存储到缓存中,执行下一个命令,但是在同一时间你只写保存到存储缓冲区,将被保存到缓存L1 / 2/3晚),即CPU -core并不需要等待的时候写将被存储到缓存中。


  • 缓存L1 / 2/3 - 看起来像透明的关联数组(地址 - 值)。这是快,但不是最快的,因为86自动提供使用高速缓存一致性协议MESIF / MOESI 的。这是更简单的多线程编程完成,但降低性能。 (诚​​然,我们可以用写免费争用算法和数据结构而不需要通过使用高速缓存一致性,即没有MESIF / MOESI例如PCI前preSS )。协议MESIF / MOESI工作过 QPI 这在CPU和多处理器系统中不同CPU之间的核心(<一个核心连接href=\"https://en.wikipedia.org/wiki/Non-uniform_memory_access#Cache_coherent_NUMA_.28ccNUMA.29\">ccNUMA).



  

CPU2不妨加载已写入到CPU1的价值
  存储缓冲器。据我了解,问题是CPU2看不到
  在CPU1的存储缓冲区新的价值。


是的。


  

为什么不能MESI协议只是
  包括冲洗存储缓冲区作为协议的一部分??


MESI协议不能仅仅包括刷新存储缓冲区作为协议的一部分,这是因为:


  • MESI / MOESI / MESIF protoclos是不相关的存储缓冲区,不知道这件事。

  • 自动冲洗存储缓冲器在每个写操作会降低性能 - 而且将使其无用

  • Manualy冲洗存储缓冲区上的所有远程的CPU内核使用某些命令(我们不知道哪些核心存储缓冲区包含必要写) - 将在相同的降低性能(在8个CPU×15 =芯芯120冲洗时间存储缓冲区 - 这是可怕的)

但manualy冲洗存储缓冲区当前的CPU核心 - 是的,你可以做到这一点通过执行 SFENCE 命令。您可以在两种情况下使用 SFENCE


  • 要提供顺序一致性的RAM与回写缓存

  • 要提供有关规则的例外采集-释放一致性:RAM与写相结合缓存,用于与CLFLUSH指令执行写入和非临时SSE / AVX指令

注意:

我们需要 LFENCE 在基于x86 / x86_64的任何情况下? - 问题并不总是很清楚:<一href=\"http://stackoverflow.com/questions/20316124/does-it-make-any-sense-instruction-lfence-in-processors-x86-x86-64\">Does它使处理器任何意义指令LFENCE的x86 / x86_64的?

其他平台:

然后,您可以在理论上具有存储缓冲区读取(在真空中的球形处理器)和无效队列,你的链接:<一个href=\"http://www.puppetmastertrading.com/images/hwViewForSwHackers.pdf\">http://www.puppetmastertrading.com/images/hwViewForSwHackers.pdf

和如何可以提供其他平台上的顺序一致性,不仅与L / S / MFENCE和锁,但并用的 LL / SC :<一href=\"http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html\">http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

Ok, I have been reading the following Qs from SO regarding x86 CPU fences (LFENCE, SFENCE and MFENCE):

and:

and I must be honest I am still not totally sure when a fence is required. I am trying to understand from the perspective of removing fully-blown locks and trying to use more fine-granular locking via fences, to minimise latency delays.

Firstly here are two specific questions I do not understand:

Sometimes when doing a store a CPU will write to its store buffer instead of the L1 cache. I do not however understand the terms on which a CPU will do this?

CPU2 may wish to load a value which has been written in to CPU1's store buffer. As I understand it, the problem is CPU2 cannot see the new value in CPU1's store buffer. Why can't the MESI protocol just include flushing store buffers as part of its protocol??

More generally, could somebody please attempt to describe the overall scenario and help explain when LFENCE/MFENCE and SFENCE instructions are required?

NB One of the problems reading around this subject is the number of articles written "generally" for multiple CPU architectures, when I am only interested in the Intel x86-64 architecture specifically.

解决方案

The simplest answer: you must use one of 3 fences (LFENCE, SFENCE, MFENCE) to provide one of 6 data Consistency:

  • Relaxed
  • Consume
  • Acquire
  • Release
  • Acquire-Release
  • Sequential

C++11:

Initially, you should consider this problem from the point of view of the degree of order of memory access, which is well documented and standardized in C++11. You should read first: http://en.cppreference.com/w/cpp/atomic/memory_order

x86/x86_64:

1. Acquire-Release Consistency: Then, it is important to understand that in the x86 to access to conventional RAM (marked by default as WB - Write Back, and the same effect with WT (Write Throught) or UC (Uncacheable)) by using asm MOV without any additional commands automatically provides order of memory for Acquire-Release Consistency - std::memory_order_acq_rel. I.e. for this memory makes sense to use only std::memory_order_seq_cst only for provide Sequential Consistency. Ie when you are using: std::memory_order_relaxed or std::memory_order_acq_rel then the compiled assembler code for std::atomic::store() (or std::atomic::load()) will be the same - only MOV without any L/S/MFENCE.

Note: But you must know, that not only CPU but and C++-compiler can reorder operations with memory, and all 6 memory barriers always affect on the C++-compiler regardless of CPU architecture.

Then, you must know, how can it be compiled from C++ to ASM (native machine code) or how can you write it on assembler. To provide any Consistency exclude Sequential you can simple write MOV, for example MOV reg, [addr] and MOV [addr], reg etc.

2. Sequential Consistency: But to provide Sequential Consistency you must use implicit (LOCK) or explicit fences (L/S/MFENCE) as described here: Why GCC does not use LOAD(without fence) and STORE+SFENCE for Sequential Consistency?

  1. LOAD (without fence) and STORE + MFENCE
  2. LOAD (without fence) and LOCK XCHG
  3. MFENCE + LOAD and STORE (without fence)
  4. LOCK XADD ( 0 ) and STORE (without fence)

For example, GCC uses 1, but MSVC uses 2. (But you must know, that MSVS2012 has a bug: Does the semantics of `std::memory_order_acquire` requires processor instructions on x86/x86_64? )

Then, you can read Herb Sutter, your link: https://onedrive.live.com/view.aspx?resid=4E86B0CF20EF15AD!24884&app=WordPdf&authkey=!AMtj_EflYn2507c

Exception to the rule:

This rule is true for access by using MOV to conventional RAM marked by default as WB - Write Back. Memory is marking in the Page Table, in each PTE (Page Table Enrty) for each Page (4 KB continuous memory).

But there are some exceptions:

  1. If we marks memory in Page Table as Write Combined (ioremap_wc() in POSIX), then automaticaly provides only Acquire Consistency, and we must act as in the following paragraph.

  2. See answer to my question: http://stackoverflow.com/a/27302931/1558037

  • Writes to memory are not reordered with other writes, with the following exceptions:
    • writes executed with the CLFLUSH instruction;
    • streaming stores (writes) executed with the non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD); and
    • string operations (see Section 8.2.4.1).

In both cases 1 & 2 you must use additional SFENCE between two writes to the same address even if you want Acquire-Release Consistency, because here automaticaly provides only Acquire Consistency and you must do Release (SFENCE) yourself.

Answer to your two questions:

Sometimes when doing a store a CPU will write to its store buffer instead of the L1 cache. I do not however understand the terms on which a CPU will do this?

From the point of view of the user the cache L1 and Store Buffer act differently. L1 fast, but Store-Buffer faster.

  • Store-Buffer - is a simple Queue where stores only Writes, and which can not be reordered - it is made for performance increase and Hide Latency of access to cache (L1 - 1ns, L2 - 3ns, L3 - 10ns) (CPU-Core think that Write has stored to the cache and executes next command, but at the same time your Writes only saved to the Store-Buffer and will be saved to the cache L1/2/3 later), i.e. CPU-Core don't need to wait when Writes will have been stored to cache.

  • Cache L1/2/3 - look like transparent associate array (address - value). It is fast but not the fastest, because x86 automatically provides Acquire-Release Consistency by using cache coherent protocol MESIF/MOESI. It is done for simpler multithread programming, but decrease performance. (Truly, we can use Write Contentions Free algorithms and data structures without using cache coherent, i.e. without MESIF/MOESI for example over PCI Express). Protocols MESIF/MOESI works over QPI which connects Cores in CPU and Cores between different CPUs in multiprocessor systems (ccNUMA).

CPU2 may wish to load a value which has been written in to CPU1's store buffer. As I understand it, the problem is CPU2 cannot see the new value in CPU1's store buffer.

Yes.

Why can't the MESI protocol just include flushing store buffers as part of its protocol??

MESI protocol can't just include flushing store buffers as part of its protocol, because:

  • MESI/MOESI/MESIF protoclos are not related to the Store-Buffer and do not know about it.
  • Automatically flushing Store Buffer at each Writes would decrease performance - and would make it useless.
  • Manualy flushing Store Buffer on all remote CPU-Cores (we don't know on which Core store-buffer contain required Write) by using some command - would decrease performance (in 8 CPUs x 15 Cores = 120 Cores at the same time flush Store-Buffer - this is terrible)

But manualy flushing Store Buffer on current CPU-Core - yes, you can do it by execute SFENCE command. You can use SFENCE in two cases:

  • To provide Sequential Consistency on RAM with Write Back cacheable
  • To provide Acquire-Release Consistency on exceptions of the rule: RAM with Write Combined cacheable, for writes executed with the CLFLUSH instruction and for Non-Temporal SSE/AVX commands

Note:

Do we need LFENCE in any cases on x86/x86_64? - the question is not always clear: Does it make any sense instruction LFENCE in processors x86/x86_64?

Other platform:

Then, you can read as in theory (for a spherical processor in vacuo) with Store-Buffer and Invalidate-Queue, your link: http://www.puppetmastertrading.com/images/hwViewForSwHackers.pdf

And how you can provide Sequential Consistency on other platforms, not only with L/S/MFENCE and LOCK but and with LL/SC: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

这篇关于当86 LFENCE,SFENCE和MFENCE指令所需?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆