x86 MESI 使缓存行延迟问题无效 [英] x86 MESI invalidate cache line latency issue

查看：76 发布时间：2021/6/15 19:16:11 performance x86 shared-memory cpu-cache mesi

本文介绍了x86 MESI 使缓存行延迟问题无效的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有以下进程，我尝试使 ProcessB 的延迟非常低，因此我一直使用紧密循环并隔离 cpu 核心 2.

I have the following processes , I try to make ProcessB very low latency so I use tight loop all the time and isolate cpu core 2 .

共享内存中的全局变量:

global var in shared memory :

int bDOIT ;
typedef struct XYZ_ {
    int field1 ;
    int field2 ;
    .....
    int field20;
}  XYZ;
XYZ glbXYZ ; 

static void escape(void* p) {
    asm volatile("" : : "g"(p) : "memory");
}

ProcessA(在核心 1 中)

ProcessA (in core 1 )

while(1){
    nonblocking_recv(fd,&iret);
    if( errno == EAGAIN)
        continue ; 
    if( iret == 1 )
        bDOIT = 1 ;
    else
        bDOIT = 0 ;
 } // while

ProcessB(在核心 2 中)

ProcessB ( in core 2 )

while(1){
    escape(&bDOIT) ;
    if( bDOIT ){
        memcpy(localxyz,glbXYZ) ; // ignore lock issue 
        doSomething(localxyz) ;
    }
} //while

ProcessC(在核心 3 中)

ProcessC ( in core 3 )

while(1){
     usleep(1000) ;
     glbXYZ.field1 = xx ;
     glbXYZ.field2 = xxx ;
     ....
     glbXYZ.field20 = xxxx ;  
} //while

在这些简单的伪代码过程中，而 ProcessesA将 bDOIT 修改为 1 ，它将使缓存行无效核心 2 ，然后在 ProcessB get bDOIT=1 之后 ProcessB会做 memcpy(localxyz,glbXYZ) .

in these simple psudo code processes , while ProcessesA modify bDOIT to 1 , it will invalidate the cache line in Core 2 , then after ProcessB get bDOIT=1 then ProcessB will do the memcpy(localxyz,glbXYZ) .

由于每次使用 1000 次，ProcessC 都会使 glbXYZ 无效Core2 ，我想这会影响延迟，而ProcessB 尝试做 memcpy(localxyz,glbXYZ) ，因为虽然ProcessB 将 bDOIT 扫描到 1 ， glbXYZ 被无效化ProcessC 已经，

Since evry 1000 usec ProcessC will invalidate glbXYZ in Core2 , I guess this will effect the latency while ProcessB try to do memcpy(localxyz,glbXYZ) , because while ProcessB scan bDOIT to 1 , glbXYZ is invalidated by ProcessC already ,

glbXYZ 的新值仍在核心 3 L1$ 或 L2$ 中，之后ProcessB 实际上得到 bDOIT=1 ，这时候 core2 知道了它的 glbXYZ 无效，因此它询问 glbXYZ 的新值此时，ProcessB 延迟受等待 glbXYZ 的新值影响.

new value of glbXYZ still in core 3 L1$ or L2$ ,after ProcessB actually get bDOIT=1 , at this time core2 know its glbXYZ is invalidated so it ask new value of glbXYZ at this moment ,ProcessB latency is effected by waiting for the new value of glbXYZ .

我的问题:

如果我有一个 processD(在核心 4 中)，它会:

if I have a processD ( in core 4) , which do :

while(1){
    usleep(10);
    memcpy(nouseXYZ,glbXYZ);
 } //while

这个 ProcessD 会让 glbXYZ 更早刷新到 L3$ 吗?当核心 2 中的 ProcessB 知道它的 glbXYZ 无效时，它会询问 glbXYZ 的新值，这个 ProcessD 将帮助 ProcoessB 更早地获得 glbXYZ ?！由于 ProcessD 始终帮助将 glbXYZ 转换为 L3$.

will this ProcessD make glbXYZ flushed to L3$ earlier so that when ProcessB in core 2 know its glbXYZ is invalidated ,it ask the new value of glbXYZ , this ProcessD will help PrcoessB get glbXYZ earlier ?! Since ProcessD help get glbXYZ to L3$ all the time .

推荐答案

有趣的想法，是的，应该可以让保持结构的缓存线进入 L3 缓存中的状态，其中 core#2 可以直接获得 L3 命中，而不必等待 MESI 读取请求，而该行在 core#2 的 L1d 中仍处于 M 状态.

Interesting idea, yeah that should probably get the cache line holding your struct into a state in L3 cache where core#2 can get an L3 hit directly, instead of having to wait for a MESI read request while the line is still in M state in the L1d of core#2.

或者如果 ProcessD 运行在与 ProcessB 相同的物理核心的另一个逻辑核心上，数据将被提取到正确的 L1d.如果它大部分时间都处于睡眠状态(并且很少醒来)，ProcessB 通常仍会将整个 CPU 留给自己，以单线程模式运行，无需对 ROB 和存储缓冲区进行分区.

Or if ProcessD is running on the other logical core of the same physical core as ProcessB, data will be fetched into the right L1d. If it spends most of its time asleep (and wakes up infrequently), ProcessB will still usually have the whole CPU to itself, running in single-thread mode without partitioning the ROB and store buffer.

与其让虚拟访问线程在 usleep(10) 上旋转，您可以让它等待 ProcessC 在写入 glbXYZ 后触发的条件变量或信号量.

Instead of having the dummy-access thread spinning on usleep(10), you could have it wait on a condition variable or a semaphore that ProcessC pokes after writing glbXYZ.

使用计数信号量(如 POSIX C 信号量 sem_wait/sem_post)，写入glbXYZ的线程可以增加信号量，触发操作系统唤醒被阻塞的ProcessDsem_down.如果由于某种原因 ProcessD 错过了它的唤醒时间，它会在再次阻塞之前进行 2 次迭代，但这很好.(嗯，所以实际上我们不需要计数信号量，但我认为我们确实需要操作系统辅助的睡眠/唤醒，这是一个简单的方法来获得它，除非我们需要在 processC 之后避免系统调用的开销编写结构.)或者 ProcessC 中的 raise() 系统调用可以发送信号以触发 ProcessD 的唤醒.

With a counting semaphore (like POSIX C semaphores sem_wait/sem_post), the thread that writes glbXYZ can increment the semaphore, triggering the OS to wake up ProcessD which is blocked in sem_down. If for some reason ProcessD misses its turn to wake up, it will do 2 iterations before it blocks again, but that's fine. (Hmm, so actually we don't need a counting semaphore, but I think we do want OS-assisted sleep/wake and this is an easy way to get it, unless we need to avoid the overhead of a system call in processC after writing the struct.) Or a raise() system call in ProcessC could send a signal to trigger wakeup of ProcessD.

使用 Spectre+Meltdown 缓解措施，任何系统调用，即使是像 Linux futex 这样的高效系统调用，对于创建它的线程来说都是相当昂贵的.不过，此成本不是您试图缩短的关键路径的一部分，而且仍远低于您在两次获取之间所考虑的 10 微秒睡眠时间.

With Spectre+Meltdown mitigation, any system call, even an efficient one like Linux futex is fairly expensive for the thread making it. This cost isn't part of the critical path that you're trying to shorten, though, and it's still much less than the 10 usec sleep you were thinking of between fetches.

void ProcessD(void) {
    while(1){
        sem_wait(something);          // allows one iteration to run per sem_post
        __builtin_prefetch (&glbXYZ, 0, 1);  // PREFETCHT2 into L2 and L3 cache
    }
}

(根据 Intel 的优化手册第 7.3.2 节，当前 CPU 上的 PREFETCHT2 与 PREFETCHT1 相同，并获取到 L2 缓存(一路上还有 L3.我没有检查 AMD.PREFETCHT2 抓取到哪个级别的缓存?).

(According to Intel's optimization manual section 7.3.2, PREFETCHT2 on current CPUs is identical to PREFETCHT1, and fetches into L2 cache (and L3 along the way. I didn't check AMD. What level of the cache does PREFETCHT2 fetch into?).

我还没有测试过 PREFETCHT2 在 Intel 或 AMD CPU 上是否真的有用.您可能想要使用虚拟 volatile 访问，例如 *(volatile char*)&glbXYZ; 或 *(volatile int*)&glbXYZ.field1.特别是如果您的 ProcessD 与 ProcessB 运行在相同的物理核心上.


I haven't tested that PREFETCHT2 will actually be useful here on Intel or AMD CPUs.  You might want to use a dummy volatile access like *(volatile char*)&glbXYZ; or *(volatile int*)&glbXYZ.field1.  Especially if you have ProcessD running on the same physical core as ProcessB.
如果 prefetchT2 工作，你可以在写 bDOIT (ProcessA) 的线程中这样做，这样它就可以在 ProcessB 之前触发行迁移到 L3将需要它.
If prefetchT2 works, you could maybe do that in the thread that writes bDOIT (ProcessA), so it could trigger the migration of the line to L3 right before ProcessB will need it.
如果您发现该行在使用前被逐出，也许您确实想要一个线程在获取该缓存行时旋转.
If you're finding that the line gets evicted before use, maybe you do want a thread spinning on fetching that cache line.
在未来的 Intel CPU 上，有一个 cldemote 指令(_cldemote(const void*)) 您可以在写入后使用它来触发脏缓存行迁移到 L3.它在不支持它的 CPU 上作为 NOP 运行，但它仅适用于 Tremont(Atom) 到此为止.(与 umonitor/umwait 一起在另一个内核在用户空间的受监控范围内写入时唤醒，这对于低延迟内核间也可能非常有用东西.)
On future Intel CPUs, there's a cldemote instruction (_cldemote(const void*)) which you could use after writing to trigger migration of the dirty cache line to L3.  It runs as a NOP on CPUs that don't support it, but it's only slated for Tremont (Atom) so far.  (Along with umonitor/umwait to wake up when another core writes in a monitored range from user-space, which would probably also be super-useful for low latency inter-core stuff.)
由于 ProcessA 不写入结构体，您可能应该确保 bDOIT 与结构体位于不同的缓存行中.您可以将 alignas(64) 放在 XYZ 的第一个成员上，以便结构从缓存行的开头开始.alignas(64) atomicbDOIT; 将确保它也在一行的开头，因此它们不能共享缓存行.或者将其设为 alignas(64) atomic 或 atomic_flag.
Since ProcessA doesn't write the struct, you should probably make sure bDOIT is in a different cache line than the struct.  You might put alignas(64) on the first member of XYZ so the struct starts at the start of a cache line.  alignas(64) atomic<int> bDOIT; would make sure it was also at the start of a line, so they can't share a cache line.  Or make it an alignas(64) atomic<bool> or atomic_flag.
另见了解std::hardware_corruption_interference_size和std::hardware_constructive_interference_size¹ :通常 128 是您想要避免由于相邻行预取器而导致错误共享的值，但如果 ProcessB 触发 L2 相邻行预取器，这实际上并不是一件坏事core#2 在 bDOIT 上旋转时推测性地将 glbXYZ 拉入其 L2 缓存.因此，如果您使用的是 Intel CPU，您可能希望将它们组合成一个 128 字节对齐的结构.
Also see Understanding std::hardware_destructive_interference_size and std::hardware_constructive_interference_size¹ : normally 128 is what you want to avoid false sharing because of adjacent-line prefetchers, but it's actually not a bad thing if ProcessB triggers the L2 adjacent-line prefetcher on core#2 to speculatively pull glbXYZ into its L2 cache when it's spinning on bDOIT.  So you might want to group those into a 128-byte aligned struct if you're on an Intel CPU.
和/或如果 bDOIT 为 false，在 processB 中，您甚至可以使用软件预取. 预取不会阻止等待数据，但如果读取请求在 ProcessC 写入 glbXYZ 的过程中到达，然后它会花费更长的时间.所以也许只有每 16 次或 64 次的软件预取 bDOIT 是假的?
And/or you might even use a software prefetch if bDOIT is false, in processB.  A prefetch won't block waiting for the data, but if the read request arrives in the middle of ProcessC writing glbXYZ then it will make that take longer.  So maybe only SW prefetch every 16th or 64th time bDOIT is false?
并且不要忘记在您的自旋循环中使用 _mm_pause()，以避免当您正在自旋的分支走另一条路时内存顺序错误推测管道核弹.(通常这是自旋等待循环中的循环退出分支，但这无关紧要.您的分支逻辑相当于包含自旋等待循环的外部无限循环，然后进行一些工作，即使这不是您编写的方式.)
And don't forget to use _mm_pause() in your spin loop, to avoid a memory-order mis-speculation pipeline nuke when the branch you're spinning on goes the other way.  (Normally this is a loop-exit branch in a spin-wait loop, but that's irrelevant.  Your branching logic is equivalent to outer infinite loop containing a spin-wait loop and then some work, even though that's not how you've written it.)
或者可能使用 lock cmpxchg 而不是纯加载来读取旧值.完全屏障已经阻止了屏障之后的投机负载，因此可以防止误投.(您可以在 C11 中使用 atomic_compare_exchange_weak 和 expected = desired 来完成此操作.它通过引用获取 expected，并在比较失败时更新它.)但是用lock cmpxchg 可能对 ProcessA 能够快速将其存储提交到 L1d 没有帮助.
Or possibly use lock cmpxchg instead of a pure load to read the old value.  Full barriers already block speculative loads after the barrier, so prevent mis-speculation.  (You can do this in C11 with atomic_compare_exchange_weak with expected = desired.  It takes expected by reference, and updates it if the compare fails.)  But hammering on the cache line with lock cmpxchg is probably not helpful to ProcessA being able to commit its store to L1d quickly.
检查 machine_clears.memory_ordering 性能计数器，看看这是否在没有 _mm_pause 的情况下发生. 如果是，请尝试 _mm_pause 首先，然后也许尝试使用 atomic_compare_exchange_weak 作为负载.或者 atomic_fetch_add(&bDOIT, 0)，因为 lock xadd 是等价的.
Check the machine_clears.memory_ordering perf counter to see if this is happening without _mm_pause.  If it is, then try _mm_pause first, and then maybe try using atomic_compare_exchange_weak as a load.  Or atomic_fetch_add(&bDOIT, 0), because lock xadd would be equivalent.
// GNU C11.  The typedef in your question looks like C, redundant in C++, so I assumed C.

#include <immintrin.h>
#include <stdatomic.h>
#include <stdalign.h>

alignas(64) atomic_bool bDOIT;
typedef struct { int a,b,c,d;       // 16 bytes
                 int e,f,g,h;       // another 16
} XYZ;
alignas(64) XYZ glbXYZ;

extern void doSomething(XYZ);

// just one object (of arbitrary type) that might be modified
// maybe cheaper than a "memory" clobber (compile-time memory barrier)
#define MAYBE_MODIFIED(x) asm volatile("": "+g"(x))

// suggested ProcessB
void ProcessB(void) {
    int prefetch_counter = 32;  // local that doesn't escape
    while(1){
        if (atomic_load_explicit(&bDOIT, memory_order_acquire)){
            MAYBE_MODIFIED(glbXYZ);
            XYZ localxyz = glbXYZ;    // or maybe a seqlock_read
  //        MAYBE_MODIFIED(glbXYZ);  // worse code from clang, but still good with gcc, unlike a "memory" clobber which can make gcc store localxyz separately from writing it to the stack as a function arg

  //          asm("":::"memory");   // make sure it finishes reading glbXYZ instead of optimizing away the copy and doing it during doSomething
            // localxyz hasn't escaped the function, so it shouldn't be spilled because of the memory barrier
            // but if it's too big to be passed in RDI+RSI, code-gen is in practice worse
            doSomething(localxyz);
        } else {

            if (0 == --prefetch_counter) {
                // not too often: don't want to slow down writes
                __builtin_prefetch(&glbXYZ, 0, 3);  // PREFETCHT0 into L1d cache
                prefetch_counter = 32;
            }

            _mm_pause();       // avoids memory order mis-speculation on bDOIT
                               // probably worth it for latency and throughput
                               // even though it pauses for ~100 cycles on Skylake and newer, up from ~5 on earlier Intel.
        }

    }
}

 此编译很好上Godbolt 以相当不错的ASM.如果 bDOIT 保持为真，则它是一个紧密循环，没有围绕调用的开销.clang7.0 甚至使用 SSE 加载/存储将结构体作为函数 arg 一次复制到堆栈中 16 个字节.
This compiles nicely on Godbolt to pretty nice asm.  If bDOIT stays true, it's a tight loop with no overhead around the call.  clang7.0 even uses SSE loads/stores to copy the struct to the stack as a function arg 16 bytes at a time.
显然，问题是一堆未定义的行为，您应该使用 _Atomic (C11) 或 std::atomic (C++11) 解决这些问题memory_order_relaxed.或者 mo_release/mo_acquire. 你在编写 bDOIT 的函数中没有任何内存屏障，所以它可能会沉没跳出循环.在内存顺序放宽的情况下使其atomic 对 asm 的质量几乎为零.
Obviously the question is a mess of undefined behaviour which you should fix with _Atomic (C11) or std::atomic (C++11) with memory_order_relaxed.  Or mo_release / mo_acquire.  You don't have any memory barrier in the function that writes bDOIT, so it could sink that out of the loop.  Making it atomic with memory-order relaxed has literally zero downside for the quality of the asm.
大概你正在使用 SeqLock 或其他东西来保护 glbXYZ 不被撕裂.是的，asm("":::"memory") 应该通过强制编译器假设它已被异步修改来使其工作."g"(glbXYZ) 输入 asm 语句是无用的，不过.它是全局的，所以 "memory" 屏障已经应用于它(因为 asm 语句已经可以引用它).如果你想告诉编译器只是它可能已经改变了，使用 asm volatile("" : "+g"(glbXYZ)); 没有 "内存" 破坏者.
Presumably you're using a SeqLock or something to protect glbXYZ from tearing.  Yes, asm("":::"memory") should make that work by forcing the compiler to assume it's been modified asynchronously.  The "g"(glbXYZ) input the the asm statement is useless, though.  It's global so the "memory" barrier already applies to it (because the asm statement could already reference it).  If you wanted to tell the compiler that just it could have changed, use asm volatile("" : "+g"(glbXYZ)); without a "memory" clobber.
或者在 C(不是 C++)中，只需将它设为 volatile 并进行结构赋值，让编译器选择如何复制它，而不使用障碍.在 C++ 中，foo x = y; 对于 volatile foo y; 失败，其中 foo 是像结构一样的聚合类型.volatile struct = struct 不可能，为什么?.当您想使用 volatile 告诉编译器数据可能会异步更改作为在 C++ 中实现 SeqLock 的一部分时，这很烦人，但您仍然希望让编译器尽可能高效地任意复制它顺序，一次不是一个狭隘的成员.
Or in C (not C++), just make it volatile and do struct assignment, letting the compiler pick how to copy it, without using barriers.  In C++, foo x = y; fails for volatile foo y; where foo is an aggregate type like a struct.  volatile struct = struct not possible, why?.  This is annoying when you want to use volatile to tell the compiler that data may change asynchronously as part of implementing a SeqLock in C++, but you still want to let the compiler copy it as efficiently as possible in arbitrary order, not one narrow member at a time.
脚注 1:C++17 指定 std::hardware_corruption_interference_size 作为硬编码 64 或使您自己的 CLSIZE 恒定的替代方案，但 gcc 和 clang 尚未实现它，因为如果在 alignas() 在结构中，因此实际上不能根据实际的 L1d 行大小而改变.
Footnote 1: C++17 specifies std::hardware_destructive_interference_size as an alternative to hard-coding 64 or making your own CLSIZE constant, but gcc and clang don't implement it yet because it becomes part of the ABI if used in an alignas() in a struct, and thus can't actually change depending on actual L1d line size.

                        这篇关于x86 MESI 使缓存行延迟问题无效的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

x86 MESI 使缓存行延迟问题无效 [英] x86 MESI invalidate cache line latency issue

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

x86 MESI 使缓存行延迟问题无效 [英] x86 MESI invalidate cache line latency issue

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭