Java 8 Unsafe:xxxFence()说明 [英] Java 8 Unsafe: xxxFence() instructions

查看:1535
本文介绍了Java 8 Unsafe:xxxFence()说明的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Java 8中,三个内存屏障说明添加到不安全类( source ):

  / ** 
*确保在栅栏之前没有重新装入货物
*与篱笆后的负载或存储。
* /
void loadFence();

/ **
*确保在篱笆之前没有重新安排商店重新安排
*装载或存储后的围栏。
* /
void storeFence();

/ **
*确保在篱笆之前没有对货物或商店进行重新排序
*装载或存储后的围栏。
* /
void fullFence();

如果我们通过以下方式定义内存屏障(我认为这样更容易理解)


将X和Y视为要重新排序的操作类型/类,



X_YFence()是一个内存屏障指令,用于确保屏障启动后在任何类型Y操作之前屏障完成之前的所有类型X操作


现在我们可以从不安全到此术语:




  • loadFence()变为 load_loadstoreFence();

  • storeFence() store_loadStoreFence code>;

  • fullFence()变为 loadstore_loadstoreFence() ;
  • 最后,为什么我们没有 load_storeFence () store_loadFence() store_storeFence() load_loadFence()



    我的猜测是 - 他们不是真的需要的,但我不明白为什么。所以,我想知道不添加它们的原因。欢迎提供猜测(希望这不会导致这个问题是偏见的,但基于意见)。



    提前感谢。

    $ CPU核心有特殊的内存排序缓冲区,以帮助他们解决这些问题。b $ b

    解决方案

    总结



    顺序执行。对于加载和存储,这些可以是(通常是)分开的:用于加载顺序缓冲器的LOB和用于存储顺序缓冲器的SOB。



    API是基于以下假设选择的:底层处理器将具有单独的加载顺序缓冲器(用于重新排序加载),存储顺序缓冲器(用于重新排序商店)。



    因此,基于这个假设,从软件的角度来看,你可以从CPU请求三个东西中的一个:


    1. 清空LOB(loadFence):表示在此核心上没有其他指令将开始执行,直到已处理LOB的所有条目。在x86这是一个LFENCE。

    2. 清空SOB(storeFence):表示在此核心上没有其他指令开始执行,直到SOB中的所有条目都已处理完毕。在x86这是一个SFENCE。

    3. 清空LOB和SOB(fullFence):表示上述两个。在x86中这是一个MFENCE。

    在现实中,每个特定的处理器架构提供不同的内存排序保证,比上述更灵活。例如,SPARC架构可以重载加载存储和存储加载序列,而x86不会这样做。此外,存在其中不能单独地控制LOB和SOB的架构(即只有全范围是可能的)。然而,在这两种情况下:




    • 当架构更灵活时,API根本不提供对laxer当架构更严格时,API在所有情况下都实现更严格的排序保证(例如,所有3个调用实际上和




    特定API选择的原因在JEP中根据回答assylias提供的是100%现场。如果你知道内存排序和缓存一致性,assylias的答案应该足够了。我认为他们匹配C ++ API中的标准化指令的事实是一个主要因素(简化JVM实现很多): http://en.cppreference.com/w/cpp/atomic/memory_order 很可能,实际的实现将调用相应的C ++ API,而不是使用一些特殊的指令。



    下面我将详细解释基于x86的例子,它将提供理解这些事情所需的所有上下文。实际上,划分的(下面的部分回答另一个问题:你能提供如何记忆围栏工作以控制缓存一致性在x86架构的基本示例吗?



    因为这是我自己(来自软件开发人员,而不是硬件设计人员)无法理解什么内存重排序,直到我学习了缓存一致性在x86中如何实际工作的具体示例,这提供了讨论内存栅栏一般的无价的上下文最后,我使用从x86示例中获得的知识来讨论SPARC。



    参考文献[1]是一个更详细的解释,有一个单独的部分讨论的每一个:x86,SPARC,ARM和PowerPC,所以如果你有兴趣更多的细节,这是一个很好的阅读。






    x86架构示例



    x86提供3种类型的防护指令:LFENCE(装载栅栏),SFENCE(存储栅栏)和MFENCE这是因为x86具有单独的加载顺序缓冲区(LOB)和存储顺序缓冲区(SOB),因此它将100%映射到Java API。



    <因此确实LFENCE / SFENCE指令适用于相应的缓冲器,而MFENCE适用于两者。



    SOB用于存储输出值(从处理器到缓存系统),而高速缓存一致性协议用于获取对高速缓存线的写入权限。 LOB用于存储无效请求,以便无效可以异步执行(减少接收端的停滞,希望在那里执行的代码不会真正需要该值)。



    无序存储和SFENCE



    假设您有一个双处理器系统,其两个CPU,0和1,执行下面的例程。考虑其中保持失败的高速缓存行最初由CPU 1拥有,而保持 shutdown 的高速缓存行最初由CPU 0拥有。

      // CPU 0:
    void shutDownWithFailure(void)
    {
    failure = 1; //必须使用SOB,因为这是CPU 1拥有
    shutdown = 1; //可以立即执行,因为它拥有CPU 0
    }
    // CPU1:
    void workLoop(void)
    {
    while(shutdown == 0) {...}
    if(failure){...}
    }

    在没有商店围栏的情况下,CPU 0可能由于故障而发出关机信号,但CPU 1将退出循环,并且不会进入故障处理(如果块)。



    这是因为CPU0会将 failure 的值1写入存储顺序缓冲区,同时发送高速缓存一致性消息以获取对高速缓存行的独占访问。然后它将继续到下一条指令(在等待独占访问时),并立即更新 shutdown 标志(这个高速缓存行由CPU0拥有,因此不需要协商其他核心)。最后,当它稍后从CPU1(关于失败)接收到无效确认消息时,它将继续处理失败并将值写入缓存(但是现在顺序颠倒了)。



    插入storeFence()将修复问题:

      // CPU 0:
    void shutDownWithFailure(void)
    {
    failure = 1; //必须使用SOB,因为它属于CPU 1
    SFENCE //下一条指令将在所有SOB被处理后执行
    shutdown = 1; //可以立即执行,因为它拥有CPU 0
    }
    // CPU1:
    void workLoop(void)
    {
    while(shutdown == 0) {...}
    if(failure){...}
    }



    <应该提到的最后一个方面是x86具有存储转发:当CPU写入一个在SOB中被阻塞的值(由于高速缓存一致性)时,它可能随后尝试为相同地址执行加载指令SOB被处理并传递到高速缓存。因此,CPU将访问高速缓存以查询SOB,因此在这种情况下检索的值是来自SOB的最后写入的值。 这意味着来自THIS核心的商店无论如何都不能与来自THIS核心的后续加载重新排序。



    无序加载和LFENCE



    现在,假设你有商店围栏,并且很高兴 shutdown 不能超过 failure 到CPU 1的路径,并集中在另一边。即使在商店围栏的存在,有一些情况下,错误的事情发生。考虑 failure 在两个高速缓存(共享)中的情况, shutdown 只存在于高速缓存中的CPU0。坏事情可能发生如下:


    1. CPU0写入1到失败; 此外,它还向CPU1发送消息以使其共享缓存行的副本无效。 c>


    2. CPU1在步骤1中接收来自CPU0的消息,以使 failure ,发送立即确认。 注意:这是使用无效队列实现的,因此实际上它只是输入一个注释(在其LOB中分配一个条目),以便稍后进行无效化,但是在发出确认之前不会实际执行。 >

    3. CPU0收到失败的确认,并通过SFENCE进入下一条指令

    4. CPU0将1写入关闭而不使用SOB,因为它已专有高速缓存行。

    5. CPU1收到 shutdown

    6. CPU1检查if语句的失败值,但是,因为无效队列(LOB note)尚未处理,所以它使用来自其本地高速缓存的值0(如果块不输入)。

    7. CPU1处理无效队列并更新失败到1,但已经太晚了...

    作为加载顺序缓冲区,是无效请求的排队,并且以上可以用以下固定:

      // CPU 0 :
    void shutDownWithFailure(void)
    {
    failure = 1; //必须使用SOB,因为它属于CPU 1
    SFENCE //下一条指令将在所有SOB被处理后执行
    shutdown = 1; //可以立即执行,因为它拥有CPU 0
    }
    // CPU1:
    void workLoop(void)
    {
    while(shutdown == 0) {...}
    LFENCE //下一条指令将在所有LOB被处理之后执行
    if(failure){...}
    }



    您在x86上的问题



    现在您知道SOB / LOB做了什么,组合:

      loadFence()变成load_loadstoreFence 

    否,装载栅栏等待LOB被处理,基本上清空无效队列。这意味着所有后续加载都将看到最新数据(无重新排序),因为它们将从缓存子系统中获取(这是一致的)。存储CANNNOT与后续装载重新排序,因为它们不通过LOB。 (并且存储转发还关注局部修改的存储线)从该特定核(执行装载栅栏的核)的角度来看,在所有寄存器都加载了数据之后,将执行跟随装载栅栏的存储。没有办法。

      load_storeFence()会变成??? 

    没有必要使用load_storeFence,因为它没有意义。要存储的东西,你必须使用输入计算。要获取输入,必须执行加载。使用从加载获取的数据进行存储。如果你想确保你在使用loadFence加载时看到所有OTHER处理器的最新值。



    所有其他情况都是类似的。






    SPARC



    SPARC更加灵活,可以对后续加载的存储进行重新排序(并加载后续存储)。我不熟悉SPARC,所以我的 GUESS 是没有存储转发(重新加载地址时不咨询SOB),因此可能会出现脏读。事实上我错了:我在[3]中发现了SPARC架构,现实是存储转发是线程化的。从第5.3.4节:



    所有装入检查存储缓冲区(仅相同线程)的写入后(RAW)危险。当负载的双字地址与STB中的存储器的地址匹配,并且负载的所有字节在存储缓冲器中有效时,发生完全RAW。当双字地址匹配时,部分RAW出现,但是所有字节在存储缓冲器中无效。 (例如,ST(字存储)和LDX(双字加载)到相同的地址会导致部分RAW,因为整个双字不在存储缓冲区条目中。)



    因此,不同的线程参考不同的存储顺序缓冲区,因此存储后读取脏值的可能性。






    参考资料



    [1]内存障碍:软件黑客的硬件视图,Linux技术中心,IBM Beaverton
    http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.07.23a .pdf



    [2]Intel®64和IA-32 ArchitecturesSoftware开发人员手册,第3A卷
    http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual。 pdf



    [3] OpenSPARC T2核心微体系结构规范 http://www.oracle.com/technetwork/systems/opensparc/t2-06-opensparct2-core-microarch-1537749.html a>


    In Java 8 three memory barrier instructions were added to Unsafe class (source):

    /**
     * Ensures lack of reordering of loads before the fence
     * with loads or stores after the fence.
     */
    void loadFence();
    
    /**
     * Ensures lack of reordering of stores before the fence
     * with loads or stores after the fence.
     */
    void storeFence();
    
    /**
     * Ensures lack of reordering of loads or stores before the fence
     * with loads or stores after the fence.
     */
    void fullFence();
    

    If we define memory barrier with the following way (which I consider more or less easy to understand):

    Consider X and Y to be operation types/classes that are subject for reordering,

    X_YFence() is a memory barrier instruction that ensures that all operations of type X before the barrier completed before any operation of type Y after the barrier is started.

    We can now "map" barrier names from Unsafe to this terminology:

    • loadFence() becomes load_loadstoreFence();
    • storeFence() becomes store_loadStoreFence();
    • fullFence() becomes loadstore_loadstoreFence();

    Finally, my question is - why don't we have load_storeFence(), store_loadFence(), store_storeFence() and load_loadFence()?

    My guess would be - they are not really neccesary, but I do not understand why at the moment. So, I'd like to know reasons for not adding them. Guesses about that are welcome too (hope this doesn't cause this question to be offtopic as opinion-based, though).

    Thanks in advance.

    解决方案

    Summary

    CPU cores have special memory ordering buffers to assist them with out-of-order execution. These can be (and typically are) separate for loading and storing: LOBs for load-order buffers and SOBs for store-order buffers.

    The fencing operations chosen for the Unsafe API are were selected based on the following assumption: underlying processors will have separate load-order buffers (for reordering loads), store-order buffers (for reordering stores).

    Therefore, based on this assumption, from a software point of view, you can request one of three things from the CPU:

    1. Empty the LOBs (loadFence): means that no other instructions will start executing on this core, until ALL entries the LOBs have been processed. In x86 this is an LFENCE.
    2. Empty the SOBs (storeFence): means that no other instructions will start executing on this core, until ALL entries in the SOBs have been processed. In x86 this is an SFENCE.
    3. Empty both LOBs and SOBs(fullFence): means both of the above. In x86 this is an MFENCE.

    In reality, each specific processor architecture provides different memory ordering guarantees, which may be more stringent, or more flexible than the above. For example, SPARC architecture can reorder load-store and store-load sequences, whereas x86 will not do that. Furthermore, architectures exist where LOBs and SOBs cannot be controlled individually (i.e. only full-fence is possible). In both cases however:

    • when the architecture is more flexible, the API simply does not provide access to the "laxer" sequencing combinations as a matter of choise

    • when the architecture is more stringent, the API simply implements the more stringent sequencing guarantee in all cases (e.g. all 3 calls actually and up being implemented as a full fence)

    The reason for the particular API choices is explained in the JEP as per the answer assylias provides which is 100% on-the-spot. If you know about memory ordering and cache coherence, assylias' answer should suffice. I think the fact that they match the standardized instruction in the C++ API was a major factor (simplifies JVM implementation a lot): http://en.cppreference.com/w/cpp/atomic/memory_order In all likelihood, actual implementation will call into the respective C++ API instead of using some special instruction.

    Below I have a detailed explanation with x86-based examples, which will provide all the context necessary to understand these things. In fact, the demarcated (section below answers another question: "Can you provide basic examples of how memory fences work to control cache coherence in the x86 architecture?"

    The reason for this is that I myself (coming from a software developer and not hardware designer) had trouble understanding what memory reordering is, until I learned specific examples of how cache coherence actually works in x86. This provides invaluable context for discussing memory fences in general (for other architectures as well). At the end I discuss SPARC a bit using the knowledge gained from the x86 examples

    The reference [1] is an even more detailed explanation and has a separate section for discussing each of: x86, SPARC, ARM and PowerPC, so it is an excellent read if you are interested in more details.


    x86 architecture example

    x86 provides 3 types of fencing instructions: LFENCE (load fence), SFENCE (store fence) and MFENCE (load-store fence), so it maps 100% to the Java API.

    This is because x86 has separate load-order buffers (LOBs) and store-order buffers (SOBs), so indeed LFENCE/SFENCE instructions apply to the respective buffer, whereas MFENCE applies to both.

    SOBs are used to store an outgoing value (from processor to cache system) while the cache coherence protocol works to acquire permission to write to the cache line. LOBs are used to store invalidation requests so that invalidation can execute asynchronously (reduces stalling on the receiving side in the hope that the code executing there will not actually need that value).

    Out-of-order stores and SFENCE

    Suppose you have a dual processor system with its two CPUs, 0 and 1, executing the routines below. Consider the case where the cache line holding failure is initially owned by CPU 1, whereas the cache line holding shutdown is initially owned by CPU 0.

    // CPU 0:
    void shutDownWithFailure(void)
    {
      failure = 1; // must use SOB as this is owned by CPU 1
      shutdown = 1; // can execute immediately as it is owned be CPU 0
    }
    // CPU1:
    void workLoop(void)
    {
      while (shutdown == 0) { ... }
      if (failure) { ...}
    }
    

    In the absence of a store fence, CPU 0 may signal a shutdown due to failure, but CPU 1 will exit the loop and NOT got into the failure-handling if block.

    This is because CPU0 will write the value 1 for failure to a store-order buffer, also sending out a cache coherence message to acquire exclusive access to the cache line. It will then proceed to the next instruction (while waiting for exclusive access) and update the shutdown flag immediately (this cache line is owned exclusively by CPU0 already so no need to negotiate with other cores). Finally, when it later receives an invalidation confirmation message from CPU1 (regarding failure) it will proceed to process the SOB for failure and write the value to the cache (but the order is by now reversed).

    Inserting a storeFence() will fix things:

    // CPU 0:
    void shutDownWithFailure(void)
    {
      failure = 1; // must use SOB as this is owned by CPU 1
      SFENCE // next instruction will execute after all SOBs are processed
      shutdown = 1; // can execute immediately as it is owned be CPU 0
    }
    // CPU1:
    void workLoop(void)
    {
      while (shutdown == 0) { ... }
      if (failure) { ...}
    }
    

    A final aspect that deserves mention is that x86 has store-forwarding: when a CPU writes a value which gets stuck in an SOB (due to cache coherence), it may subsequently attempt to execute a load instruction for the same address BEFORE the SOB is processed and delivered to the cache. CPUs will therefore consult the SOBs PRIOR to accessing the cache, so the value retrieved in this case is the last-written value from the SOB. this means that stores from THIS core can never be reordered with subsequent loads from THIS core no matter what.

    Out-of-order loads and LFENCE

    Now, assume you have the store fence in place and are happy that shutdown cannot overtake failure on its way to CPU 1, and focus on the other side. Even in the presence of the store fence, there are scenarios where the wrong thing happens. Consider the case where failure is in both caches (shared) whereas shutdown is only present in and owned exclusively by the cache of CPU0. Bad things can happen as follows:

    1. CPU0 writes 1 to failure; It also sends a message to CPU1 to invalidate its copy of the shared cache line as part of the cache coherence protocol.
    2. CPU0 executes the SFENCE and stalls, waiting for the SOB used for failure to commit.
    3. CPU1 checks shutdown due to the while loop and (realizing it is missing the value) sends a cache coherence message to read the value.
    4. CPU1 receives the message from CPU0 in step 1 to invalidate failure, sending an immediate acknowledgement for it. NOTE: this is implemented using the invalidation queue, so in fact it simply enters a note (allocates an entry in its LOB) to later do the invalidation, but does not actually perform it before sending out the acknowledgement.
    5. CPU0 receives the acknowledgement for failure and proceeds past the SFENCE to the next instruction
    6. CPU0 writes 1 to shutdown without using a SOB, because it already owns the cache line exclusively. no extra message for invalidation is sent as the cache line is exclusive to CPU0
    7. CPU1 receives the shutdown value and commits it to its local cache, proceeding to the next line.
    8. CPU1 checks the failure value for the if statement, but since the invalidate queue (LOB note) is not yet processed, it uses the value 0 from its local cache (does not enter if block).
    9. CPU1 processes the invalidate queue and update failure to 1, but it is already too late...

    What we refer to as load order buffers, is actaully the queueing of invalidation requests, and the above can be fixed with:

    // CPU 0:
    void shutDownWithFailure(void)
    {
      failure = 1; // must use SOB as this is owned by CPU 1
      SFENCE // next instruction will execute after all SOBs are processed
      shutdown = 1; // can execute immediately as it is owned be CPU 0
    }
    // CPU1:
    void workLoop(void)
    {
      while (shutdown == 0) { ... }
      LFENCE // next instruction will execute after all LOBs are processed
      if (failure) { ...}
    }
    

    Your question on x86

    Now that you know what SOBs/LOBs do, think about the combinations you mentioned:

    loadFence() becomes load_loadstoreFence();
    

    No, a load fence waits for LOBs to be processed, essentially emptying the invalidation queue. This means that all subsequent loads will see up-to-date data (no re-ordering), as they will be fetched from the cache sub-system (which is coherent). Stores CANNNOT be reordered with subsequent loads, because they do not go through the LOB. (and furthermore store forwarding takes care of locally-modified cachce lines) From the perspective of THIS particular core (the one executing the load fence), a store that follows the load fence will execute AFTER all registers have the data loaded. There is no way around it.

    load_storeFence() becomes ???
    

    There is no need for a load_storeFence as it does not make sense. To store something you must calculate it using input. To fetch input you must execute loads. The stores will occur using the data fetched from loads. If you want to make sure you see up-to-date values from all OTHER processors when loading use a loadFence. For loads after the fence store-forwarding takes care of consistent ordering.

    All other cases are similar.


    SPARC

    SPARC is even more flexible and can reorder stores with subsequent loads (and loads with subsequent stores). I was not as familiar with SPARC, so my GUESS was that there is no store-forwarding (SOBs are not consulted when reloading an address) so "dirty reads" are possible. In fact I was wrong: I found the SPARC architecture in [3] and the reality is that store-forwarding is threaded. From section 5.3.4:

    All loads check the store buffer (same thread only) for read after write (RAW) hazards. A full RAW occurs when the dword address of the load matches that of a store in the STB and all bytes of the load are valid in the store buffer. A partial RAW occurs when the dword addresses match, but all bytes are not valid in the store buffer. (Ex., a ST (word store) followed by an LDX (dword load) to the same address results in a partial RAW, because the full dword is not in the store buffer entry.)

    So, different threads consult different store-order buffers hence the possibility for dirty reads after stores.


    References

    [1] Memory Barriers: a Hardware View for Software Hackers, Linux Technology Center, IBM Beaverton http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.07.23a.pdf

    [2] Intel® 64 and IA-32 ArchitecturesSoftware Developer’s Manual, Volume 3A http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf

    [3] OpenSPARC T2 Core Microarchitecture Specification http://www.oracle.com/technetwork/systems/opensparc/t2-06-opensparct2-core-microarch-1537749.html

    这篇关于Java 8 Unsafe:xxxFence()说明的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆