Java 8 不安全:xxxFence() 指令 [英] Java 8 Unsafe: xxxFence() instructions

查看:37
本文介绍了Java 8 不安全:xxxFence() 指令的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Java 8 中,三个内存屏障指令被添加到 Unsafe 类(source):

In Java 8 three memory barrier instructions were added to Unsafe class (source):

/**
 * Ensures lack of reordering of loads before the fence
 * with loads or stores after the fence.
 */
void loadFence();

/**
 * Ensures lack of reordering of stores before the fence
 * with loads or stores after the fence.
 */
void storeFence();

/**
 * Ensures lack of reordering of loads or stores before the fence
 * with loads or stores after the fence.
 */
void fullFence();

如果我们用以下方式定义内存屏障(我认为或多或少容易理解):

If we define memory barrier with the following way (which I consider more or less easy to understand):

将 X 和 Y 视为需要重新排序的操作类型/类,

Consider X and Y to be operation types/classes that are subject for reordering,

X_YFence() 是一条内存屏障指令,它确保屏障之前所有类型 X 的操作在屏障启动之后的任何 Y 类型操作之前完成.

X_YFence() is a memory barrier instruction that ensures that all operations of type X before the barrier completed before any operation of type Y after the barrier is started.

我们现在可以映射"从 Unsafe 到这个术语的屏障名称:

We can now "map" barrier names from Unsafe to this terminology:

  • loadFence() 变成 load_loadstoreFence();
  • storeFence() 变成 store_loadStoreFence();
  • fullFence() 变成 loadstore_loadstoreFence();
  • loadFence() becomes load_loadstoreFence();
  • storeFence() becomes store_loadStoreFence();
  • fullFence() becomes loadstore_loadstoreFence();

最后,我的问题是 - 为什么我们没有 load_storeFence()store_loadFence()store_storeFence()load_loadFence()?

Finally, my question is - why don't we have load_storeFence(), store_loadFence(), store_storeFence() and load_loadFence()?

我的猜测是 - 它们并不是真正必要的,但我目前不明白为什么.所以,我想知道不添加它们的原因.关于这一点的猜测也是受欢迎的(不过希望这不会导致这个问题因为基于意见而脱离主题).

My guess would be - they are not really neccesary, but I do not understand why at the moment. So, I'd like to know reasons for not adding them. Guesses about that are welcome too (hope this doesn't cause this question to be offtopic as opinion-based, though).

提前致谢.

推荐答案

总结

CPU 内核具有特殊的内存排序缓冲区,以帮助它们进行乱序执行.这些可以(并且通常是)单独用于加载和存储:用于加载顺序缓冲区的 LOB 和用于存储顺序缓冲区的 SOB.

Summary

CPU cores have special memory ordering buffers to assist them with out-of-order execution. These can be (and typically are) separate for loading and storing: LOBs for load-order buffers and SOBs for store-order buffers.

为 Unsafe API 选择的防护操作是基于以下假设选择的:底层处理器将具有单独的加载顺序缓冲区(用于重新排序加载)、存储顺序缓冲区(用于重新排序存储).

The fencing operations chosen for the Unsafe API are were selected based on the following assumption: underlying processors will have separate load-order buffers (for reordering loads), store-order buffers (for reordering stores).

因此,基于此假设,从软件的角度来看,您可以向 CPU 请求以下三项之一:

Therefore, based on this assumption, from a software point of view, you can request one of three things from the CPU:

  1. 清空 LOB (loadFence):意味着在此核心上不会开始执行其他指令,直到 LOB 的所有条目都已处理完毕.在 x86 中,这是一个 LFENCE.
  2. 清空 SOB (storeFence):意味着在处理完 SOB 中的所有条目之前,不会在此核心上开始执行其他指令.在 x86 中,这是一个 SFENCE.
  3. 同时清空 LOBs 和 SOBs(fullFence):表示以上两种情况.在 x86 中,这是一个 MFENCE.

实际上,每个特定的处理器架构都提供了不同的内存排序保证,这可能比上述更严格,或者更灵活.例如,SPARC 体系结构可以重新排序加载-存储和存储-加载序列,而 x86 不会这样做.此外,存在无法单独控制 LOB 和 SOB 的架构(即,只能使用全栅栏).然而,在这两种情况下:

In reality, each specific processor architecture provides different memory ordering guarantees, which may be more stringent, or more flexible than the above. For example, SPARC architecture can reorder load-store and store-load sequences, whereas x86 will not do that. Furthermore, architectures exist where LOBs and SOBs cannot be controlled individually (i.e. only full-fence is possible). In both cases however:

  • 当架构更灵活时,API 根本不提供对松散"排序组合的访问作为一个选择

  • when the architecture is more flexible, the API simply does not provide access to the "laxer" sequencing combinations as a matter of choise

当架构更严格时,API 只会在所有情况下实现更严格的顺序保证(例如,所有 3 次调用实际上都被实现为一个完整的栅栏)

when the architecture is more stringent, the API simply implements the more stringent sequencing guarantee in all cases (e.g. all 3 calls actually and up being implemented as a full fence)

根据 assylias 提供的 100% 现场答案,JEP 中解释了选择特定 API 的原因.如果您了解内存排序和缓存一致性,那么 assylias 的回答就足够了.我认为它们与 C++ API 中的标准化指令相匹配的事实是一个主要因素(大大简化了 JVM 实现):http://en.cppreference.com/w/cpp/atomic/memory_order 很可能,实际实现将调用相应的 C++ API,而不是使用某些特殊指令.

The reason for the particular API choices is explained in the JEP as per the answer assylias provides which is 100% on-the-spot. If you know about memory ordering and cache coherence, assylias' answer should suffice. I think the fact that they match the standardized instruction in the C++ API was a major factor (simplifies JVM implementation a lot): http://en.cppreference.com/w/cpp/atomic/memory_order In all likelihood, actual implementation will call into the respective C++ API instead of using some special instruction.

下面我详细解释了基于 x86 的示例,这些示例将提供理解这些内容所需的所有上下文.事实上,划界的(下面的部分回答了另一个问题:您能否提供有关内存栅栏如何工作以控制 x86 架构中的缓存一致性的基本示例?"

Below I have a detailed explanation with x86-based examples, which will provide all the context necessary to understand these things. In fact, the demarcated (section below answers another question: "Can you provide basic examples of how memory fences work to control cache coherence in the x86 architecture?"

这样做的原因是我自己(来自软件开发人员而不是硬件设计人员)在理解内存重新排序是什么之前遇到了困难,直到我学习了缓存一致性在 x86 中实际工作的具体示例.这为讨论一般的内存栅栏(也适用于其他架构)提供了宝贵的上下文.最后,我使用从 x86 示例中获得的知识稍微讨论了 SPARC

The reason for this is that I myself (coming from a software developer and not hardware designer) had trouble understanding what memory reordering is, until I learned specific examples of how cache coherence actually works in x86. This provides invaluable context for discussing memory fences in general (for other architectures as well). At the end I discuss SPARC a bit using the knowledge gained from the x86 examples

参考文献 [1] 是一个更详细的解释,并有单独的部分来讨论:x86、SPARC、ARM 和 PowerPC,因此如果您对更多细节感兴趣,这是一个很好的阅读.

The reference [1] is an even more detailed explanation and has a separate section for discussing each of: x86, SPARC, ARM and PowerPC, so it is an excellent read if you are interested in more details.

x86 提供了 3 种栅栏指令:LFENCE(加载栅栏)、SFENCE(存储栅栏)和 MFENCE(加载-存储栅栏),因此它 100% 映射到 Java API.

x86 provides 3 types of fencing instructions: LFENCE (load fence), SFENCE (store fence) and MFENCE (load-store fence), so it maps 100% to the Java API.

这是因为 x86 具有单独的加载顺序缓冲区 (LOB) 和存储顺序缓冲区 (SOB),因此确实 LFENCE/SFENCE 指令适用于相应的缓冲区,而 MFENCE 则适用于两者.

This is because x86 has separate load-order buffers (LOBs) and store-order buffers (SOBs), so indeed LFENCE/SFENCE instructions apply to the respective buffer, whereas MFENCE applies to both.

SOB 用于存储传出值(从处理器到缓存系统),而缓存一致性协议用于获取写入缓存行的权限.LOB 用于存储失效请求,以便失效可以异步执行(减少接收端的停顿,希望在那里执行的代码实际上不需要该值).

SOBs are used to store an outgoing value (from processor to cache system) while the cache coherence protocol works to acquire permission to write to the cache line. LOBs are used to store invalidation requests so that invalidation can execute asynchronously (reduces stalling on the receiving side in the hope that the code executing there will not actually need that value).

假设您有一个双处理器系统,它有两个 CPU,0 和 1,执行下面的例程.考虑这样一种情况,其中保存 failure 的缓存线最初由 CPU 1 所有,而保存 shutdown 的缓存线最初由 CPU 0 所有.

Suppose you have a dual processor system with its two CPUs, 0 and 1, executing the routines below. Consider the case where the cache line holding failure is initially owned by CPU 1, whereas the cache line holding shutdown is initially owned by CPU 0.

// CPU 0:
void shutDownWithFailure(void)
{
  failure = 1; // must use SOB as this is owned by CPU 1
  shutdown = 1; // can execute immediately as it is owned be CPU 0
}
// CPU1:
void workLoop(void)
{
  while (shutdown == 0) { ... }
  if (failure) { ...}
}

在没有存储栅栏的情况下,CPU 0 可能会因故障而发出关闭信号,但 CPU 1 将退出循环并且不会进入故障处理 if 块.

In the absence of a store fence, CPU 0 may signal a shutdown due to failure, but CPU 1 will exit the loop and NOT got into the failure-handling if block.

这是因为 CPU0 会将 failure 的值 1 写入存储顺序缓冲区,同时发送缓存一致性消息以获取对缓存行的独占访问权限.然后它将继续执行下一条指令(在等待独占访问时)并立即更新 shutdown 标志(该缓存线已由 CPU0 独占所有,因此无需与其他内核协商).最后,当它稍后收到来自 CPU1 的无效确认消息(关于 failure)时,它将继续处理 failure 的 SOB 并将值写入缓存(但顺序现在已经颠倒了).

This is because CPU0 will write the value 1 for failure to a store-order buffer, also sending out a cache coherence message to acquire exclusive access to the cache line. It will then proceed to the next instruction (while waiting for exclusive access) and update the shutdown flag immediately (this cache line is owned exclusively by CPU0 already so no need to negotiate with other cores). Finally, when it later receives an invalidation confirmation message from CPU1 (regarding failure) it will proceed to process the SOB for failure and write the value to the cache (but the order is by now reversed).

插入 storeFence() 会解决问题:

Inserting a storeFence() will fix things:

// CPU 0:
void shutDownWithFailure(void)
{
  failure = 1; // must use SOB as this is owned by CPU 1
  SFENCE // next instruction will execute after all SOBs are processed
  shutdown = 1; // can execute immediately as it is owned be CPU 0
}
// CPU1:
void workLoop(void)
{
  while (shutdown == 0) { ... }
  if (failure) { ...}
}

最后一个值得一提的方面是 x86 具有存储转发功能:当 CPU 写入一个卡在 SOB 中的值(由于缓存一致性)时,它可能随后尝试执行相同地址的加载指令 BEFORESOB 被处理并传送到缓存.因此,CPU 将在访问缓存之前咨询 SOB,因此在这种情况下检索的值是从 SOB 中最后写入的值.这意味着无论如何都不能在此核心的后续加载中重新排序来自此核心的存储.

A final aspect that deserves mention is that x86 has store-forwarding: when a CPU writes a value which gets stuck in an SOB (due to cache coherence), it may subsequently attempt to execute a load instruction for the same address BEFORE the SOB is processed and delivered to the cache. CPUs will therefore consult the SOBs PRIOR to accessing the cache, so the value retrieved in this case is the last-written value from the SOB. this means that stores from THIS core can never be reordered with subsequent loads from THIS core no matter what.

现在,假设您已设置好存储围栏,并且很高兴 shutdown 在通往 CPU 1 的过程中无法超越 failure,然后将注意力集中在另一边.即使在商店围栏的存在下,也有发生错误的情况.考虑 failure 在两个缓存(共享)中的情况,而 shutdown 仅存在于 CPU0 的缓存中并由其独占拥有.不好的事情可能会发生如下:

Now, assume you have the store fence in place and are happy that shutdown cannot overtake failure on its way to CPU 1, and focus on the other side. Even in the presence of the store fence, there are scenarios where the wrong thing happens. Consider the case where failure is in both caches (shared) whereas shutdown is only present in and owned exclusively by the cache of CPU0. Bad things can happen as follows:

  1. CPU0 将 1 写入 failure它还向 CPU1 发送消息,以使其共享缓存行的副本无效,作为缓存一致性协议的一部分.
  2. CPU0 执行 SFENCE 并停止,等待用于 failure 的 SOB 提交.
  3. CPU1 由于 while 循环检查 shutdown 并且(意识到它缺少值)发送缓存一致性消息以读取该值.
  4. CPU1 在步骤 1 中收到来自 CPU0 的消息,使 failure 无效,并立即发送确认消息.注意:这是使用失效队列实现的,因此实际上它只是输入一个注释(在其 LOB 中分配一个条目)以稍后进行失效,但在发送确认之前实际上并未执行.
  5. CPU0 接收到 failure 的确认,并通过 SFENCE 进入下一条指令
  6. CPU0 在不使用 SOB 的情况下将 1 写入关闭,因为它已经独占缓存线.没有发送额外的失效消息,因为缓存行是 CPU0 专用的
  7. CPU1 接收 shutdown 值并将其提交到其本地缓存,继续下一行.
  8. CPU1 检查 if 语句的 failure 值,但由于尚未处理无效队列(LOB 注释),它使用其本地缓存中的值 0(不进入 if 块).
  9. CPU1 处理无效队列并将 failure 更新为 1,但已经太晚了...
  1. CPU0 writes 1 to failure; It also sends a message to CPU1 to invalidate its copy of the shared cache line as part of the cache coherence protocol.
  2. CPU0 executes the SFENCE and stalls, waiting for the SOB used for failure to commit.
  3. CPU1 checks shutdown due to the while loop and (realizing it is missing the value) sends a cache coherence message to read the value.
  4. CPU1 receives the message from CPU0 in step 1 to invalidate failure, sending an immediate acknowledgement for it. NOTE: this is implemented using the invalidation queue, so in fact it simply enters a note (allocates an entry in its LOB) to later do the invalidation, but does not actually perform it before sending out the acknowledgement.
  5. CPU0 receives the acknowledgement for failure and proceeds past the SFENCE to the next instruction
  6. CPU0 writes 1 to shutdown without using a SOB, because it already owns the cache line exclusively. no extra message for invalidation is sent as the cache line is exclusive to CPU0
  7. CPU1 receives the shutdown value and commits it to its local cache, proceeding to the next line.
  8. CPU1 checks the failure value for the if statement, but since the invalidate queue (LOB note) is not yet processed, it uses the value 0 from its local cache (does not enter if block).
  9. CPU1 processes the invalidate queue and update failure to 1, but it is already too late...

我们所说的加载顺序缓冲区,实际上是无效请求的排队,以上可以解决:

What we refer to as load order buffers, is actaully the queueing of invalidation requests, and the above can be fixed with:

// CPU 0:
void shutDownWithFailure(void)
{
  failure = 1; // must use SOB as this is owned by CPU 1
  SFENCE // next instruction will execute after all SOBs are processed
  shutdown = 1; // can execute immediately as it is owned be CPU 0
}
// CPU1:
void workLoop(void)
{
  while (shutdown == 0) { ... }
  LFENCE // next instruction will execute after all LOBs are processed
  if (failure) { ...}
}

关于 x86 的问题

既然您知道 SOB/LOB 的作用,请考虑一下您提到的组合:

Your question on x86

Now that you know what SOBs/LOBs do, think about the combinations you mentioned:

loadFence() becomes load_loadstoreFence();

不,负载栅栏等待 LOB 被处理,实质上是清空失效队列.这意味着所有后续加载都将看到最新数据(无重新排序),因为它们将从缓存子系统中获取(这是一致的).存储不能在后续加载中重新排序,因为它们不通过 LOB.(此外存储转发负责处理本地修改的缓存行)从这个特定内核(执行加载栅栏的内核)的角度来看,加载栅栏后面的存储将在所有寄存器加载数据后执行.没有办法解决.

No, a load fence waits for LOBs to be processed, essentially emptying the invalidation queue. This means that all subsequent loads will see up-to-date data (no re-ordering), as they will be fetched from the cache sub-system (which is coherent). Stores CANNNOT be reordered with subsequent loads, because they do not go through the LOB. (and furthermore store forwarding takes care of locally-modified cachce lines) From the perspective of THIS particular core (the one executing the load fence), a store that follows the load fence will execute AFTER all registers have the data loaded. There is no way around it.

load_storeFence() becomes ???

不需要 load_storeFence,因为它没有意义.要存储某些东西,您必须使用输入来计算它.要获取输入,您必须执行加载.将使用从负载中获取的数据进行存储.如果您想确保在加载时看到来自所有其他处理器的最新值,请使用 loadFence.对于栅栏存储转发处理一致排序之后的负载.

There is no need for a load_storeFence as it does not make sense. To store something you must calculate it using input. To fetch input you must execute loads. The stores will occur using the data fetched from loads. If you want to make sure you see up-to-date values from all OTHER processors when loading use a loadFence. For loads after the fence store-forwarding takes care of consistent ordering.

所有其他情况都相似.

SPARC 更加灵活,可以对带有后续加载的存储(以及带有后续存储的加载)重新排序.我对 SPARC 不太熟悉,所以我 GUESS 是没有存储转发(重新加载地址时不咨询 SOB)所以脏读"是可能的.事实上我错了:我在 [3] 中发现了 SPARC 架构,而实际情况是存储转发是线程化的.来自第 5.3.4 节:

SPARC is even more flexible and can reorder stores with subsequent loads (and loads with subsequent stores). I was not as familiar with SPARC, so my GUESS was that there is no store-forwarding (SOBs are not consulted when reloading an address) so "dirty reads" are possible. In fact I was wrong: I found the SPARC architecture in [3] and the reality is that store-forwarding is threaded. From section 5.3.4:

所有加载都检查存储缓冲区(仅限同一线程)是否存在写后读 (RAW) 危险.当加载的双字地址与 STB 中存储的地址匹配并且加载的所有字节在存储缓冲区中都有效时,就会发生完整的 RAW.当双字地址匹配时会发生部分 RAW,但存储缓冲区中的所有字节均无效.(例如,ST(字存储)后跟 LDX(双字加载)到相同地址会导致部分 RAW,因为完整的双字不在存储缓冲区条目中.)

All loads check the store buffer (same thread only) for read after write (RAW) hazards. A full RAW occurs when the dword address of the load matches that of a store in the STB and all bytes of the load are valid in the store buffer. A partial RAW occurs when the dword addresses match, but all bytes are not valid in the store buffer. (Ex., a ST (word store) followed by an LDX (dword load) to the same address results in a partial RAW, because the full dword is not in the store buffer entry.)

因此,不同的线程会查询不同的存储顺序缓冲区,因此可能会在存储后发生脏读.

So, different threads consult different store-order buffers hence the possibility for dirty reads after stores.

[1] 内存屏障:软件黑客的硬件视图,Linux 技术中心,IBM Beavertonhttp://www.rdrop.com/users/paulmck/scalability/paper/为什么mb.2010.07.23a.pdf

[1] Memory Barriers: a Hardware View for Software Hackers, Linux Technology Center, IBM Beaverton http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.07.23a.pdf

[2] 英特尔® 64 位和 IA-32 架构软件开发人员手册,第 3A 卷http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf

[2] Intel® 64 and IA-32 ArchitecturesSoftware Developer’s Manual, Volume 3A http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf

[3] OpenSPARC T2 核心微架构规范 http:///www.oracle.com/technetwork/systems/opensparc/t2-06-opensparct2-core-microarch-1537749.html

[3] OpenSPARC T2 Core Microarchitecture Specification http://www.oracle.com/technetwork/systems/opensparc/t2-06-opensparct2-core-microarch-1537749.html

这篇关于Java 8 不安全:xxxFence() 指令的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆