使用函数 _mm_clflush 刷新大型结构的正确方法 [英] The right way to use function _mm_clflush to flush a large struct

查看:101
本文介绍了使用函数 _mm_clflush 刷新大型结构的正确方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我开始使用 _mm_clflush_mm_clflushopt_mm_clwb 等函数.

I am starting to use functions like _mm_clflush, _mm_clflushopt, and _mm_clwb.

现在说,因为我已经定义了一个结构名称 mystruct,它的大小是 256 字节.我的缓存行大小是 64 字节.现在我想刷新包含 mystruct 变量的缓存行.以下哪种方式是正确的?

Say now as I have defined a struct name mystruct and its size is 256 Bytes. My cacheline size is 64 Bytes. Now I want to flush the cacheline that contains the mystruct variable. Which of the following way is the right way to do so?

_mm_clflush(&mystruct)

for (int i = 0; i < sizeof(mystruct)/64; i++) {

     _mm_clflush( ((char *)&mystruct) + i*64)

}

推荐答案

clflush CPU 指令不知道您的结构体的大小;它只刷新一个缓存行,该行包含指针操作数指向的字节.(C 内在将其公开为 const void*,但 char* 也有意义,特别是考虑到 asm 文档 将其描述为 8 位内存操作数.)

The clflush CPU instruction doesn't know the size of your struct; it only flushes exactly one cache line, the one containing the byte pointed to by the pointer operand. (The C intrinsic exposes this as a const void*, but char* would also make sense, especially given the asm documentation which describes it as an 8-bit memory operand.)

如果您的结构不是 alignas(64),您需要 4 次刷新 64 字节,或者可能需要 5,因此它可以在 5 个不同的行中包含部分.(您可以无条件地刷新结构的最后一个字节,而不是使用更复杂的逻辑来检查它是否在您尚未刷新的缓存行中,这取决于 clflush 与更多逻辑的相对成本以及可能的分支预测错误.)

You need 4 flushes 64 bytes apart, or maybe 5 if your struct isn't alignas(64) so it could have parts in 5 different lines. (You could unconditionally flush the last byte of the struct, instead of using more complex logic to check if it's in a cache line you haven't flushed yet, depending on relative cost of clflush vs. more logic and a possible branch mispredict.)

您的原始循环在结构的开头对 4 个相邻字节进行了 4 次刷新.
使用指针增量可能是最简单的,这样转换就不会与关键逻辑混淆.

Your original loop did 4 flushes of 4 adjacent bytes at the start of your struct.
It's probably easiest to use pointer increments so the casting is not mixed up with the critical logic.

// first attempt, a bit clunky:
    const int LINESIZE = 64;
    const char *lastbyte = (const char *)(&mystruct+1) - 1;
    for (const char *p = (const char *)&mystruct; p <= lastbyte ; p+=LINESIZE) {
         _mm_clflush( p );
    }
    // if mystruct is guaranteed aligned by 64, you're done.  Otherwise not:

    // check if next line to maybe flush contains the last byte of the struct; if not then it was already flushed.
    if( ((uintptr_t)p ^ (uintptr_t)lastbyte) & -LINESIZE == 0 )
        _mm_clflush( lastbyte );

x^y 在它们不同的位位置中为 1.<代码>x &-LINESIZE 丢弃地址的行内偏移位,只保留行号位.因此,我们可以仅通过 XOR 和 TEST 指令查看 2 个地址是否在同一高速缓存行中.(或者 clang 将其优化为更短的 cmp 指令).

x^y is 1 in bit-positions where they differ. x & -LINESIZE discards the offset-within-line bits of the address, keeping only the line-number bits. So we can see if 2 addresses are in the same cache line or not with just XOR and TEST instructions. (Or clang optimizes that to a shorter cmp instruction).

或者将其重写为单个循环,使用 if 逻辑作为终止条件:

我使用了 C++ struct foo &var 引用,因此我可以遵循您的 &var 语法,但仍然可以看到它如何为采用指针 arg 的函数进行编译.适应 C 很简单.

I used a C++ struct foo &var reference so I could follow your &var syntax but still see how it compiles for a function taking a pointer arg. Adapting to C is straightforward.

/* I think this version is best: 
  * compact setup / small code-size
  * with no extra latency for the initial pointer
  * doesn't need to peel a final iteration
*/
inline
void flush_structfoo(struct foo &mystruct) {
    const int LINESIZE = 64;
    const char *p = (const char *)&mystruct;
    uintptr_t endline = ((uintptr_t)&mystruct + sizeof(mystruct) - 1) | (LINESIZE-1);
    // set the offset-within-line address bits to get the last byte 
    // of the cacheline containing the end of the struct.

    do {   // flush while p is in a cache line that contains any of the struct
         _mm_clflush( p );
          p += LINESIZE;
    } while(p <= (const char*)endline);
}

使用GCC10.2 -O3为x86-64的,这编译很好(Godbolt)

With GCC10.2 -O3 for x86-64, this compiles nicely (Godbolt)

flush_v3(foo&):
        lea     rax, [rdi+255]
        or      rax, 63
.L11:
        clflush [rdi]
        add     rdi, 64
        cmp     rdi, rax
        jbe     .L11
        ret

GCC 不会展开,并且如果您使用 alignas(64) struct foo{...}; 很遗憾,也不会进行更好的优化.您可以使用 if (alignof(mystruct) >= 64) { ... } 来检查是否需要特殊处理来让 GCC 优化得更好,否则只需使用 end = p + sizeof(mystruct);end = (const char*)(&mystruct+1) - 1; 或类似.

GCC doesn't unroll, and doesn't optimize any better if you use alignas(64) struct foo{...}; unfortunately. You might use if (alignof(mystruct) >= 64) { ... } to check if special handling is needed to let GCC optimize better, otherwise just use end = p + sizeof(mystruct); or end = (const char*)(&mystruct+1) - 1; or similar.

(在 C 中,#include for #define for alignas()alignof() 就像 C++,而不是 ISO C11 _Alignas_Alignof 关键字.)

(In C, #include <stdalign.h> for #define for alignas() and alignof() like C++, instead of ISO C11 _Alignas and _Alignof keywords.)

另一种选择是这个,但它更笨拙并且需要更多的设置工作.

Another alternative is this, but it's clunkier and takes more setup work.

    const int LINESIZE = 64;
    uintptr_t line = (uintptr_t)&mystruct & -LINESIZE;
    uintptr_t lastline = ((uintptr_t)&mystruct + sizeof(mystruct) - 1) & -LINESIZE;
    do {               // always at least one flush; works on small structs
         _mm_clflush( (void*)line );
          line += LINESIZE;
    } while(line < lastline);

一个 257 字节的结构体总是恰好接触 5 个缓存行,不需要检查.或者一个已知按 4.IDK 对齐的 260 字节结构体,如果我们可以让 GCC 优化基于此的检查.

A struct that was 257 bytes would always touch exactly 5 cache lines, no checking needed. Or a 260-byte struct that's known to be aligned by 4. IDK if we can get GCC to optimize away the checks based on that.

这篇关于使用函数 _mm_clflush 刷新大型结构的正确方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆