使用时间戳计数器和 clock_gettime 进行缓存未命中 [英] Using time stamp counter and clock_gettime for cache miss

查看:48
本文介绍了使用时间戳计数器和 clock_gettime 进行缓存未命中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为此主题的后续,为了计算内存未命中延迟,我使用 _mm_clflush__rdtsc_mm_lfence(基于以下代码)编写了以下代码这个问题/答案).

As a follow-up to this topic, in order to calculate the memory miss latency, I have wrote the following code using _mm_clflush, __rdtsc and _mm_lfence (which is based on the code from this question/answer).

正如您在代码中看到的,我首先将数组加载到缓存中.然后我刷新一个元素,因此缓存行从所有缓存级别中逐出.我把 _mm_lfence 放在 -O3 期间为了保持顺序.

As you can see in the code, I first load the array into the cache. Then I flush one element and therefore the cache line is evicted from all cache levels. I put _mm_lfence in order to preserve the order during -O3.

接下来,我使用时间戳计数器来计算延迟或读取array[0].如您所见,在两个时间戳之间,存在三个指令:两个 lfence 和一个 read.所以,我必须减去 lfence 开销.代码的最后一部分计算了开销.

Next, I used time stamp counter to calculate the latency or reading array[0]. As you can see between two time stamps, there are three instructions: two lfence and one read. So, I have to subtract the lfence overhead. The last section of the code calculates that overhead.

在代码的末尾,打印了开销和未命中延迟.但是,结果无效!

At the end of the code, the overhead and miss latency are printed. However, the result is not valid!

#include <stdio.h>
#include <stdint.h>
#include <x86intrin.h>
int main()
{
    int array[ 100 ];
    for ( int i = 0; i < 100; i++ )
            array[ i ] = i;
    uint64_t t1, t2, ov, diff;

    _mm_lfence();
    _mm_clflush( &array[ 0 ] );
    _mm_lfence();

    _mm_lfence();
    t1 = __rdtsc();
    _mm_lfence();
    int tmp = array[ 0 ];
    _mm_lfence();
    t2 = __rdtsc();
    _mm_lfence();

    diff = t2 - t1;
    printf( "diff is %lu\n", diff );

    _mm_lfence();
    t1 = __rdtsc();
    _mm_lfence();
    _mm_lfence();
    t2 = __rdtsc();
    _mm_lfence();
    ov = t2 - t1;
    printf( "lfence overhead is %lu\n", ov );
    printf( "miss cycles is %lu\n", diff-ov );

    return 0;
}

但是,输出无效

$ gcc -O3 -o flush1 flush1.c
$ taskset -c 0 ./flush1
diff is 161
lfence overhead is 147
miss cycles is 14
$ taskset -c 0 ./flush1
diff is 161
lfence overhead is 154
miss cycles is 7
$ taskset -c 0 ./flush1
diff is 147
lfence overhead is 154
miss cycles is 18446744073709551609

有什么想法吗?

接下来,我尝试了 clock_gettime 函数来计算未命中延迟,如下所示

Next, I tried clock_gettime function in order to calculate the miss latency as below

    _mm_lfence();
    _mm_clflush( &array[ 0 ] );
    _mm_lfence();

    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    _mm_lfence();
    int tmp = array[ 0 ];
    _mm_lfence();
    clock_gettime(CLOCK_MONOTONIC, &end);
    diff = 1000000000 * (end.tv_sec - start.tv_sec) + end.tv_nsec - start.tv_nsec;
    printf("miss elapsed time = %lu nanoseconds\n", diff);

输出为 miss elapsed time = 578 纳秒.那可靠吗?

更新 1:

感谢 Peter 和 Hadi,总结到目前为止的答复,我发现了

Thanks to Peter and Hadi, to summarize the responses till now, I found out

1- 在优化阶段省略了未使用的变量,这就是我在输出中看到的奇怪值的原因.感谢 Peter 的回复,有一些方法可以解决这个问题.

1- Unused variables are omitted in the optimization phase and that was the reason on weird values I seen in the output. Thanks to Peter's reply, there are some ways to fix that.

2- clock_gettime 不适合这种分辨率,该函数用于较大的延迟.

2- clock_gettime is not suitable for such resolution and that function is used for larger delays.

作为一种解决方法,我尝试将数组放入缓存,然后刷新所有元素以确保所有元素都从所有缓存级别中逐出.然后我测量了 array[0]array[20] 的延迟.由于每个元素为 4 字节,因此距离为 80 字节.我希望得到两次缓存未命中.但是,array[20] 的延迟类似于缓存命中.一个安全的猜测是缓存行不是 80 字节.所以,也许 array[20] 是由硬件预取的.并非总是如此,但我也再次看到了一些奇怪的结果

As a workaround, I tried to bring the array in to the cache and then flush all elements to be sure that all elements are evicted from all cache levels. Then I measured the latency of array[0] and then array[20]. Since each element is 4-bytes, the distance is 80 bytes. I expect to get two cache misses. However, the latency of array[20] is similar to a cache hit. A safe guess is that the cache line is not 80 bytes. So, maybe array[20] is prefetched by hardware. Not always, but I also see some odd results again

    for ( int i = 0; i < 100; i++ ) {
            _mm_lfence();
            _mm_clflush( &array[ i ] );
            _mm_lfence();
    }

    _mm_lfence();
    t1 = __rdtsc();
    _mm_lfence();
    int tmp = array[ 0 ];
    _mm_lfence();
    t2 = __rdtsc();
    _mm_lfence();
    diff1 = t2 - t1;
    printf( "tmp is %d\ndiff1 is %lu\n", tmp, diff1 );

    _mm_lfence();
    t1 = __rdtsc();
    tmp = array[ 20 ];
    _mm_lfence();
    t2 = __rdtsc();
    _mm_lfence();
    diff2 = t2 - t1;
    printf( "tmp is %d\ndiff2 is %lu\n", tmp, diff2 );

    _mm_lfence();
    t1 = __rdtsc();
    _mm_lfence();
    _mm_lfence();
    t2 = __rdtsc();
    _mm_lfence();
    ov = t2 - t1;
    printf( "lfence overhead is %lu\n", ov );
    printf( "TSC1 is %lu\n", diff1-ov );
    printf( "TSC2 is %lu\n", diff2-ov );

输出是

$ ./flush1
tmp is 0
diff1 is 371
tmp is 20
diff2 is 280
lfence overhead is 147
TSC1 is 224
TSC2 is 133
$ ./flush1
tmp is 0
diff1 is 399
tmp is 20
diff2 is 280
lfence overhead is 154
TSC1 is 245
TSC2 is 126
$ ./flush1
tmp is 0
diff1 is 392
tmp is 20
diff2 is 840
lfence overhead is 147
TSC1 is 245
TSC2 is 693
$ ./flush1
tmp is 0
diff1 is 364
tmp is 20
diff2 is 140
lfence overhead is 154
TSC1 is 210
TSC2 is 18446744073709551602

HW prefetcher 带来其他块"的声明那么大约 80% 是正确的.这是怎么回事?还有更准确的说法吗?

The statement that "HW prefetcher brings other blocks" is about 80% correct then. What is the going on then? Any more accurate statement?

推荐答案

您通过删除最后的 tmp 读取破坏了 Hadi 的代码,因此它被 gcc 优化掉了. 您的定时区域没有负载.C 语句不是汇编指令.

You broke Hadi's code by removing the read of tmp at the end, so it gets optimized away by gcc. There is no load in your timed region. C statements are not asm instructions.

查看编译器生成的 asm,例如编译探险.当您尝试对此类非常低级的东西进行微基准测试时,您应该始终这样做,尤其是当您的计时结果出乎意料时.

Look at the compiler-generated asm, e.g. on the Godbolt compiler explorer. You should always be doing this when you're trying to microbenchmark really low-level stuff like this, especially if your timing results are unexpected.

    lfence
    clflush [rcx]
    lfence

    lfence
    rdtsc                     # start of first timed region
    lfence
       # nothing because tmp=array[0] optimized away.
    lfence
    mov     rcx, rax
    sal     rdx, 32
    or      rcx, rdx
    rdtsc                     # end of first timed region
    mov     edi, OFFSET FLAT:.LC2
    lfence

    sal     rdx, 32
    or      rax, rdx
    sub     rax, rcx
    mov     rsi, rax
    mov     rbx, rax
    xor     eax, eax
    call    printf

您从 -Wall 收到有关未使用变量的编译器警告,但您可以通过仍然优化的方式使其静音.例如您的 tmp++ 不会使 tmp 可用于函数之外的任何内容,因此它仍然可以优化.忽略警告是不够的:打印值、返回值或将其分配给定时区域之外的 volatile 变量.(或者使用内联 asm volatile 要求编译器在某个时候将它保存在寄存器中.Chandler Carruth 在 CppCon2015 上关于使用 perf 的讨论提到了一些技巧:https://www.youtube.com/watch?v=nXaxk27zwlk)

You get a compiler warning about an unused variable from -Wall, but you can silence that in ways that still optimize away. e.g. your tmp++ doesn't make tmp available to anything outside the function, so it still optimizes away. Silencing the warning is not sufficient: print the value, return the value, or assign it to a volatile variable outside the timed region. (Or use inline asm volatile to require the compiler to have it in a register at some point. Chandler Carruth's CppCon2015 talk about using perf mentions some tricks: https://www.youtube.com/watch?v=nXaxk27zwlk)

在 GNU C(至少使用 gcc 和 clang -O3)中,您可以通过强制转换为 (volatile int*) 来强制读取,像这样:

In GNU C (at least with gcc and clang -O3), you can force a read by casting to (volatile int*), like this:

// int tmp = array[0];           // replace this
(void) *(volatile int*)array;    // with this

(void) 是为了避免在 void 上下文中对表达式求值时出现警告,例如编写 x;.

The (void) is to avoid a warning for evaluating an expression in a void context, like writing x;.

这种看起来像严格别名的 UB,但我的理解是 gcc 定义了这种行为.Linux 内核会投射一个指针以在其 ACCESS_ONCE 宏中添加一个 volatile 限定符,因此它被用于 gcc 肯定关心支持的代码库之一.你总是可以让整个数组 volatile;如果它的初始化不能自动矢量化也没关系.

This kind of looks like strict-aliasing UB, but my understanding is that gcc defines this behaviour. The Linux kernel casts a pointer to add a volatile qualifier in its ACCESS_ONCE macro, so it's used in one of the codebases that gcc definitely cares about supporting. You could always make the whole array volatile; it doesn't matter if initialization of it can't auto-vectorize.

无论如何,这编译为

    # gcc8.2 -O3
    lfence
    rdtsc
    lfence
    mov     rcx, rax
    sal     rdx, 32
    mov     eax, DWORD PTR [rsp]    # the load which wasn't there before.
    lfence
    or      rcx, rdx
    rdtsc
    mov     edi, OFFSET FLAT:.LC2
    lfence

那么您就不必担心确保使用 tmp,也不必担心死存储消除、CSE 或常量传播.实际上,_mm_mfence() 或 Hadi 原始答案中的其他内容包括足够的内存屏障,使 gcc 实际上重做缓存未命中 + 缓存命中情况的负载,但它很容易优化掉重装之一.

Then you don't have to mess around with making sure tmp is used, or with worrying about dead-store elimination, CSE, or constant-propagation. In practice the _mm_mfence() or something else in Hadi's original answer included enough memory-barriering to make gcc actually redo the load for the cache-miss + cache-hit case, but it easily could have optimized away one of the reloads.

请注意,这可能导致 asm 加载到寄存器中但从不读取它.当前的 CPU 仍然在等待结果(特别是如果有 lfence),但是覆盖结果可能会让假设的 CPU 丢弃负载而不是等待它.(这取决于编译器是否碰巧在下一个 lfence 之前对寄存器执行其他操作,例如 rdtsc 结果的 mov 部分.)

Note that this can result in asm that loads into a register but never reads it. Current CPUs do still wait for the result (especially if there's an lfence), but overwriting the result could let a hypothetical CPU discard the load and not wait for it. (It's up to the compiler whether it happens to do something else with the register before the next lfence, like mov part of the rdtsc result there.)

硬件很难做到这一点/不太可能,因为 CPU 必须为异常做好准备,请参阅 此处评论中的讨论.)据报道,RDRAND 确实以这种方式工作(Ivy Bridge 上 RDRAND 指令的延迟和吞吐量是多少?),但这可能是一个特例.

This is tricky / unlikely for hardware to do, because the CPU has to be ready for exceptions, see discussion in comments here.) RDRAND reportedly does work that way (What is the latency and throughput of the RDRAND instruction on Ivy Bridge?), but that's probably a special case.

我自己在 Skylake 上测试了这个,方法是在编译器的 asm 输出中添加一个 xor eax,eax,紧跟在 mov eax, DWORD PTR [rsp] 之后,杀死缓存未命中加载的结果.这并不影响时间.

I tested this myself on Skylake by adding an xor eax,eax to the compiler's asm output, right after the mov eax, DWORD PTR [rsp], to kill the result of the cache-miss load. That didn't affect the timing.

尽管如此,这仍然是一个潜在的问题,可以丢弃 volatile 加载的结果;未来的 CPU 可能会有不同的表现.最好对加载结果(在定时区域之外)求和并在最后将它们分配给 volatile int sink,以防未来的 CPU 开始丢弃产生未读结果的 uops.但是仍然使用 volatile 来加载,以确保它们发生在你想要的地方.

Still, this is a potential gotcha with discarding the results of a volatile load; future CPUs might behave differently. It might be better to sum the load results (outside the timed region) and assign them at the end to a volatile int sink, in case future CPUs start discarding uops that produce unread results. But still use volatile for the loads to make sure they happen where you want them.

另外不要忘记进行某种预热循环以使 CPU 达到最大速度,除非您想要测量缓存未命中的执行情况空闲时钟速度下的时间.看起来您的空定时区域占用了大量参考周期,因此您的 CPU 的时钟可能非常缓慢.

Also don't forget to do some kind of warm-up loop to get the CPU up to max speed, unless you want to measure the cache-miss execution time at idle clock speed. It looks like your empty timed region is taking a lot of reference cycles, so your CPU was probably clocked down pretty slow.

那么,缓存攻击究竟如何,例如崩溃和幽灵,克服这样的问题?基本上,他们必须禁用硬件预取器,因为他们会尝试测量相邻地址以查找是否命中或未命中.

So, how exactly cache attacks, e.g. meltdown and spectre, overcome such issue? Basically they have to disable hw prefetcher since they try to measure adjacent addresses in order to find if they are hit or miss.

作为 Meltdown 或 Spectre 攻击一部分的缓存读取侧信道通常使用足够大的步幅,以至于硬件预取无法检测到访问模式.例如在单独的页面而不是连续的行上.meltdown cache read prefetch stride 的第一个谷歌点击是 https://medium.com/@mattklein123/meltdown-spectre-explained-6bc8634cc0c2,使用 4096 的步幅.对 Spectre 来说可能更难,因为你的步幅受小工具"的支配您可以在目标进程中找到.

The cache-read side-channel as part of a Meltdown or Spectre attack typically uses a stride large enough that HW prefetching can't detect the access pattern. e.g. on separate pages instead of contiguous lines. One of the first google hits for meltdown cache read prefetch stride was https://medium.com/@mattklein123/meltdown-spectre-explained-6bc8634cc0c2, which uses a stride of 4096. It could be tougher for Spectre, because your stride is at the mercy of the "gadgets" you can find in the target process.

这篇关于使用时间戳计数器和 clock_gettime 进行缓存未命中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆