在x64上获取/释放具有非时间存储的语义 [英] Acquire/release semantics with non-temporal stores on x64

查看:212
本文介绍了在x64上获取/释放具有非时间存储的语义的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有类似的:

  if(f = acquire_load()==){
...使用Foo
}

和:

  auto f = new Foo();你可以很容易想象一下acquire_load和release_store的一个实现,它使用了原子(atomic)和一个与load(memory_order_acquire)和store(memory_order_release)。但是现在如果release_store是用_mm_stream_si64实现的,非时间写,这不是相对于x64上的其他商店排序?如何获得相同的语义?



我认为以下是最低要求:

 code> atomic< Foo *> gFoo; 

Foo * acquire_load(){
return gFoo.load(memory_order_relaxed);
}

void release_store(Foo * f){
_mm_stream_si64(*(Foo **)& gFoo,f);
}

并使用它:

  //线程1 
if(f = acquire_load()==){
_mm_lfence
... use Foo
}

和:

  //线程2 
auto f = new Foo();
_mm_sfence(); //确保Foo是由发布到gFoo的时间f构造的
release_store(f)

这是正确的吗?我很肯定这里的sfence是绝对必要的。但是,lfence怎么样?是否需要或者一个简单的编译器障碍对于x64是否足够?例如asm volatile(:::memory)。根据x86内存模型,负载不会与其他负载重新排序。所以,根据我的理解,acquire_load()必须发生在if语句之前的任何加载,只要有一个编译器障碍。

解决方案

>我可能错了这个答案中的一些东西(校对阅读欢迎来自知道这个东西的人!)。这是基于阅读文档和Jeff Preshing的博客,而不是最近的实际经验或测试。



Linus Torvalds强烈建议不要试图发明自己的锁定,因为它很容易得到它错了。在为Linux内核编写可移植代码时,这更多的是一个问题,而不是只有x86的可移植代码,所以我非常勇敢地尝试为x86排序。






首先,对单指针全局变量使用NT存储是疯狂。你可能想要使用NT存储到 Foo 它指向,但从高速缓存驱逐指针本身是不正当的。 (是的, movnt 存储逐出缓存行,如果它在缓存中开始,请参阅vol1 ch 10.4.6.2
缓存临时与非临时数据)。你的函数名字也不能真正反映你在做什么。



我认为做一堆NT商店会更加健全memset或memcpy类型的东西),然后一个 SFENCE ,然后正常release_store: done_flag.store(1,std :: memory_order_release)



我看不到如何使用 movnti 存储到同步变量性能。 NT存储的整个点是与非时间数据一起使用的,如果有的话,将不会被(由任何线程)再次使用。控制对共享缓冲区的访问的锁或者生产者/消费者用于将数据标记为已读的标志预期将被读取。



x86硬件针对发布商店进行了极大的优化,因为每个正常商店都是发布商店。






movnt 商店可以与其他商店重新排序,但不是与旧的读取。英特尔的 x86手册vol3,第8.2章.2(P6和更多最近的处理器系列中的内存排序)说,




  • 读取不会与其他读取重新排序。 / li>
  • 写入不会与旧读取重新排序。 (注意缺少例外)。

  • 对内存的写入不会与其他写入重新排序,除了以下例外:


  • ...关于clflushopt和围栏指示的资料






Release semantics 阻止写入释放的内存重新排序,并以程序顺序在它之前的任何读取或写入操作。因此,需要StoreStore屏障( SFENCE ),但就足够了。但是,WB内存的x86内存模型已经阻止了LoadStore重新排序,即使对于弱排序的商店也是如此,所以我们不需要<$ hc =http: //stackoverflow.com/questions/22106843/gccs-reordering-of-read-write-instructions/22142537#22142537\"> LoadStore屏障效应,只有一个LoadStore编译器障碍。 (例如 std :: atomic_signal_fence(std :: memory_order_release),但你可能只是使用一个thread_fence(它不会为x86发出任何指令,代码可移植到其他架构,并带有 _mm _ 东西)。

  //函数不能被调用release_store,除非它实际上是一个(即包括所有必要的障碍)
//你的原始函数应该被称为relaxed_store
void release_store(const Foo * f){
// _mm_lfence(); //确保从锁定区域的所有读取已经是全局可见的nvm,已经保证了
std :: atomic_thread_fence(std :: memory_order_release); //没有insns发射x86(因为它假设没有NT存储),但仍然有编译器障碍
_mm_sfence(); //确保所有写入锁定区域已经是全局可见的
_mm_stream_si64((long long int *)& ; gFoo,(int64_t)f);
}

注意缺少取消引用& gFoo )。你的函数存储到 Foo 它指向,这是超奇怪的; IDK是什么意思。另请注意,它 compiles作为有效的C ++ 11代码



当考虑一个release-store意味着什么时,认为它是一个释放锁的共享数据结构。在您的情况下,当发布商店全局可见时,任何看到它的线程都应该能够安全地解除引用。






< h3>要做一个获取负载,只是告诉编译器你想要一个。

x86不需要任何屏障指令,但指定 mo_acquire 而不是 mo_relaxed 为您提供必要的编译器障碍。作为奖励,此功能是可移植的:您将在其他架构上获得任何和所有必要的障碍:

  Foo * acquire_load ){
return gFoo.load(std :: memory_order_acquire);
}

你没有说什么关于存储 gFoo 在WC内存中。这可能真的很难安排你的程序的数据段被映射到WC内存...。 gFoo 简单地指向 em> WC内存,之后你mmap一些视频RAM或东西。但是如果你想从WC内存获取负载,你可能需要 LFENCE 。 IDK。提出另一个问题,因为这个答案大多假设你使用WB内存。



请注意,使用指针而不是标志创建一个数据依赖。我认为你应该能够使用 gFoo.load(std :: memory_order_consume),即使在弱有序的CPU(除了Alpha之外)也不需要障碍。一旦编译器足够先进,以确保它们不破坏数据依赖,他们实际上可以更好的代码(而不是促进 mo_consume mo_acquire 。在生产代码中使用 mo_consume 之前,请阅读此文件,尤其注意,正确测试是不可能的,因为未来的编译器应该在实践中给出比当前编译器更弱的保证。






最初我想我们需要LFENCE来获得LoadStore障碍。(写入不能通过早期的LFENCE,SFENCE和MFENCE指令,这反过来又会阻止它们在LFENCE之前传递(在全局上可见)。



请注意,LFENCE + SFENCE仍然比完整的MFENCE更弱,因为它不是StoreLoad障碍。SFENCE自己的文档说它是有序的wrt.LFENCE,但是英特尔手册vol3的x86内存模型的表没有提到。如果SFENCE直到LFENCE之后才能执行,那么 sfence / lfence c $ c> mfence ,但 lfence / sfence / movnti 会给出没有完全屏障的释放语义。注意,NT存储可以在一些后续加载/存储之后变得全局可见,这与正常的强有序x86存储不同。)




< h3> NT载入

我知道你没有问这个问题,但我写这个部分,然后才意识到你没有真正提到他们。在研究这个之前,我不知道什么样的重新排序NT载入可以有,所以这是一个想知道。



在x86 ,每个负载都具有获取语义,除了来自WC存储器的负载。当在正常(WriteBack)存储器上使用时,SSE4.1 MOVNTDQA 是唯一的非时间加载指令,并且



请注意, movntdq 只有一个存储形式,而 movntdqa 只有一个加载表单。但显然英特尔不能只叫他们 storentdqa loadntdqa 。他们都有一个16B或32B对齐要求,所以离开 a 对我来说没有什么意义。我想SSE1和SSE2已经引入了一些NT存储已经使用 mov ... 助记符(如 movntps ),但在多年后的SSE4.1中没有载荷。 (第二代Core2:45nm Penryn)。



MOVNTDQA 的文档说不会改变顺序


...实现
也可以使用非时态提示如果存储器源是WB(写
back)存储器类型,则与该指令相关联。



处理器的非暂时提示的实现不重写有效的内存类型语义,但是
的提示的实现是依赖于处理器的。例如,处理器实现可以选择
忽略该提示并将该指令作为任何存储器类型的正常MOVDQA来处理。


我的(未测试的)猜测:uarch如何实现它:将新加载的NT行插入到LRU位置的缓存,而不是通常的MRU位置。 (有关相关概念,请参见本文关于IvB的自适应L3策略。 )因此,巨大数组的流式加载可能只污染一组关联缓存的方式。 (TODO:测试这个理论!)






此外,如果内存(例如从视频RAM复制,如在此英特尔指南):


由于WC协议使用弱排序内存一致性模型,MFENCE或锁定指令
应与MOVNTDQA指令一起使用,如果多个处理器可能引用相同的WC
存储器位置或为了同步处理器的读取与系统中的其他代理的写入。


虽然没有说明应该如何使用 。也许只有作家需要篱笆?我不完全确定为什么他们说MFENCE而不是SFENCE或LFENCE。也许他们正在谈论一个写到设备内存,从设备内存读取情况,其中存储必须相对于加载(StoreLoad障碍)进行排序,而不仅仅是彼此(StoreStore障碍)。

我在Vol3搜索 movntdqa ,没有得到任何匹配(在整个pdf)。 3命中 movntdq :所有关于弱排序和内存类型的讨论只谈商店。请注意, LFENCE 早在SSE4.1之前引入。大概是有用的东西,但IDK什么。对于加载订单,可能只有WC内存,但我还没有读到什么时候这将是有用的。






另请参阅:非暂时






LFENCE code>似乎不仅仅是弱有序加载的LoadLoad障碍:它也命令其他指令。



从英特尔的insn参考手册:



(不是商店的全局可见性,只是他们的本地执行)


具体来说,LFENCE不会执行,直到所有之前的指令都在本地完成,并且没有后来的instruc- $ b $开始执行,直到LFENCE完成。

...

LFENCE之后的指令可能会在LFENCE之前从内存中获取,但它们将不会执行,直到
的LFENCE完成。


rdtsc 的条目建议使用 LFENCE; RDTSC 以防止它在前一个当 RDTSCP 不可用时(并且较弱的排序保证是可以的: rdtscp 不会停止以下指令从提前执行)。 ( CPUID 是序列化 rdtsc 周围的指令流的常见建议。


I have something like:

if (f = acquire_load() == ) {
   ... use Foo
}

and:

auto f = new Foo();
release_store(f)

You could easily imagine an implementation of acquire_load and release_store that uses atomic with load(memory_order_acquire) and store(memory_order_release). But now what if release_store is implemented with _mm_stream_si64, a non-temporal write, which is not ordered with respect to other stores on x64? How to get the same semantics?

I think the following is the minimum required:

atomic<Foo*> gFoo;

Foo* acquire_load() {
    return gFoo.load(memory_order_relaxed);
}

void release_store(Foo* f) {
   _mm_stream_si64(*(Foo**)&gFoo, f);
}

And use it as so:

// thread 1
if (f = acquire_load() == ) {
   _mm_lfence(); 
   ... use Foo
}

and:

// thread 2
auto f = new Foo();
_mm_sfence(); // ensures Foo is constructed by the time f is published to gFoo
release_store(f)

Is that correct? I'm pretty sure the sfence is absolutely required here. But what about the lfence? Is it required or would a simple compiler barrier be enough for x64? e.g. asm volatile("": : :"memory"). According the the x86 memory model, loads are not re-ordered with other loads. So to my understanding, acquire_load() must happen before any load inside the if statement, as long as there's a compiler barrier.

解决方案

I might be wrong about some things in this answer (proof-reading welcome from people that know this stuff!). It's based on reading the docs and Jeff Preshing's blog, not actual recent experience or testing.

Linus Torvalds strongly recommends against trying to invent your own locking, because it's so easy to get it wrong. It's more of an issue when writing portable code for the Linux kernel, rather than something that's x86-only, so I feel brave enough to try to sort things out for x86.


First of all, using NT stores for a single pointer global variable is insane. You might want to use NT stores into the Foo it points to, but evicting the pointer itself from cache is perverse. (And yes, movnt stores evict the cache line if it was in cache to start with, see vol1 ch 10.4.6.2 Caching of Temporal vs. Non-Temporal Data). Your function names also don't really reflect what you're doing.

I think it would be a lot more sane to do a bunch of NT stores (e.g. for a memset or memcpy type of thing), then an SFENCE, then a normal release_store: done_flag.store(1, std::memory_order_release).

I don't see how using a movnti store to the synchronization variable could possibly improve performance. The whole point of NT stores is for use with Non-Temporal data, which won't be used again (by any thread) for a long time if ever. The locks that control access to shared buffers, or the flags that producers/consumers use to mark data as read, are expected to be read.

x86 hardware is extremely heavily optimized for doing release-stores, because every normal store is a release-store. The hardware has to be good at it for x86 to run fast.


movnt stores can be reordered with other stores, but not with older reads. Intel's x86 manual vol3, chapter 8.2.2 (Memory Ordering in P6 and More Recent Processor Families) says that

  • Reads are not reordered with other reads.
  • Writes are not reordered with older reads. (note the lack of exceptions).
  • Writes to memory are not reordered with other writes, with the following exceptions:
  • ... stuff about clflushopt and the fence instructions

Release semantics prevent memory reordering of the write-release with any read or write operation which precedes it in program order. So a StoreStore barrier (SFENCE) is necessary but not sufficient. However, the x86 memory model for WB memory already prevents LoadStore reordering even for weakly-ordered stores, so we don't need an LFENCE for its LoadStore barrier effect, only a LoadStore compiler barrier. (e.g. std::atomic_signal_fence(std::memory_order_release), but you might as well just use a thread_fence (which won't emit any instructions for x86, but will make your code portable to other architectures with the _mm_ stuff taken out).

// The function can't be called release_store unless it actually is one (i.e. includes all necessary barriers)
// Your original function should be called relaxed_store
void release_store(const Foo* f) {
   // _mm_lfence();  // make sure all reads from the locked region are already globally visible.  nvm, this is already guaranteed
   std::atomic_thread_fence(std::memory_order_release);  // no insns emitted on x86 (since it assumes no NT stores), but still a compiler barrier
   _mm_sfence();  // make sure all writes to the locked region are already globally visible
   _mm_stream_si64((long long int*)&gFoo, (int64_t)f);
}

This stores to the atomic variable (note the lack of dereferencing &gFoo). Your function stores to the Foo it points to, which is super weird; IDK what the point of that was. Also note that it compiles as valid C++11 code.

When thinking about what a release-store means, think about it as the store that releases the lock on a shared data structure. In your case, when the release-store becomes globally visible, any thread that sees it should be able to safely dereference it.


To do an acquire-load, just tell the compiler you want one.

x86 doesn't need any barrier instructions, but specifying mo_acquire instead of mo_relaxed gives you the necessary compiler-barrier. As a bonus, this function is portable: you'll get any and all necessary barriers on other architectures:

Foo* acquire_load() {
    return gFoo.load(std::memory_order_acquire);
}

You didn't say anything about storing gFoo in WC memory. It's probably really hard to arrange for your program's data segment to be mapped into WC memory... It would be a lot easier for gFoo to simply point to WC memory, after you mmap some video RAM or something. But if you want acquire-loads from WC memory, you probably do need LFENCE. IDK. Ask another question about that, because this answer mostly assumes you're using WB memory.

Note that using a pointer instead of a flag creates a data dependency. I think you should be able to use gFoo.load(std::memory_order_consume), which doesn't require barriers even on weakly-ordered CPUs (other than Alpha). Once compilers are sufficiently advanced to make sure they don't break the data dependency, they can actually make better code (instead of promoting mo_consume to mo_acquire. Read up on this before using mo_consume in production code, and esp. be careful to note that testing it properly is impossible because future compilers are expected to give weaker guarantees than current compilers in practice do.


Initially I was thinking that we did need LFENCE to get a LoadStore barrier. ("Writes cannot pass earlier LFENCE, SFENCE, and MFENCE instructions". This in turn prevents them from passing (becoming globally visible before) reads that are before the LFENCE).

Note that LFENCE + SFENCE is still weaker than a full MFENCE, because it's not a StoreLoad barrier. SFENCE's own documentation says it's ordered wrt. LFENCE, but that table of the x86 memory model from Intel manual vol3 doesn't mention that. If SFENCE can't execute until after an LFENCE, then sfence / lfence might actually be a slower equivalent to mfence, but lfence / sfence / movnti would give release semantics without a full barrier. Note that the NT store could become globally visible after some following loads/stores, unlike a normal strongly-ordered x86 store.)


NT loads

I know you didn't ask this, but I wrote this part before realizing you hadn't actually mentioned them. Before researching this, I wasn't sure what kind of reordering NT loads could have, so it's something I wanted to know.

In x86, every load has acquire semantics, except for loads from WC memory. SSE4.1 MOVNTDQA is the only non-temporal load instruction, and it isn't weakly ordered when used on normal (WriteBack) memory. So it's an acquire-load, too (when used on WB memory).

Note that movntdq only has a store form, while movntdqa only has a load form. But apparently Intel couldn't just call them storentdqa and loadntdqa. They both have a 16B or 32B alignment requirement, so leaving off the a doesn't make a lot of sense to me. I guess SSE1 and SSE2 had already introduced some NT stores already using the mov... mnemonic (like movntps), but no loads until years later in SSE4.1. (2nd-gen Core2: 45nm Penryn).

The docs for MOVNTDQA say it doesn't change the ordering semantics for the memory type it's used on.

... An implementation may also make use of the non-temporal hint associated with this instruction if the memory source is WB (write back) memory type.

A processor’s implementation of the non-temporal hint does not override the effective memory type semantics, but the implementation of the hint is processor dependent. For example, a processor implementation may choose to ignore the hint and process the instruction as a normal MOVDQA for any memory type.

My (untested) guess at how a uarch might implement it: insert the newly-loaded NT line into the cache at the LRU position, instead of at the usual MRU position. (See this article about IvB's adaptive L3 policy for a related idea.) So streaming-loads of a giant array might only pollute one "way" of set-associative caches. (TODO: test this theory!)


Also, if you are using it on WC memory (e.g. copying from video RAM, like in this Intel guide):

Because the WC protocol uses a weakly-ordered memory consistency model, an MFENCE or locked instruction should be used in conjunction with MOVNTDQA instructions if multiple processors might reference the same WC memory locations or in order to synchronize reads of a processor with writes by other agents in the system.

That doesn't spell out how it should be used, though. Maybe only writers need to fence? And I'm not totally sure why they say MFENCE rather than SFENCE or LFENCE. Maybe they're talking about a write-to-device-memory, read-from-device-memory situation where stores have to be ordered with respect to loads (StoreLoad barrier), not just with each other (StoreStore barrier).

I searched in Vol3 for movntdqa, and didn't get any hits (in the whole pdf). 3 hits for movntdq: All the discussion of weak ordering and memory types only talks about stores. Note that LFENCE was introduced long before SSE4.1. Presumably it's useful for something, but IDK what. For load ordering, probably only with WC memory, but I haven't read up on when that would be useful.


See also: Non-temporal loads and the hardware prefetcher, do they work together?


LFENCE appears to be more than just a LoadLoad barrier for weakly-ordered loads: it orders other instructions too. (Not the global-visibility of stores, though, just their local execution).

From Intel's insn ref manual:

Specifically, LFENCE does not execute until all prior instructions have completed locally, and no later instruc- tion begins execution until LFENCE completes.
...
Instructions following an LFENCE may be fetched from memory before the LFENCE, but they will not execute until the LFENCE completes.

The entry for rdtsc suggests using LFENCE;RDTSC to prevent it from executing ahead of previous instructions, when RDTSCP isn't available (and the weaker ordering guarantee is ok: rdtscp doesn't stop following instructions from executing ahead of it). (CPUID is a common suggestion for a serializing the instruction stream around rdtsc).

这篇关于在x64上获取/释放具有非时间存储的语义的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆