使用时间戳计数器测量内存延迟 [英] Memory latency measurement with time stamp counter

查看:16
本文介绍了使用时间戳计数器测量内存延迟的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我编写了以下代码,它首先刷新两个数组元素,然后尝试读取元素以测量命中/未命中延迟.

I have written the following code which first flushes two array elements and then tries to read elements in order to measure the hit/miss latencies.

#include <stdio.h>
#include <stdint.h>
#include <x86intrin.h>
#include <time.h>
int main()
{
    /* create array */
    int array[ 100 ];
    int i;
    for ( i = 0; i < 100; i++ )
        array[ i ] = i;   // bring array to the cache

    uint64_t t1, t2, ov, diff1, diff2, diff3;

    /* flush the first cache line */
    _mm_lfence();
    _mm_clflush( &array[ 30 ] );
    _mm_clflush( &array[ 70 ] );
    _mm_lfence();

    /* READ MISS 1 */
    _mm_lfence();           // fence to keep load order
    t1 = __rdtsc();         // set start time
    _mm_lfence();
    int tmp = array[ 30 ];   // read the first elemet => cache miss
    _mm_lfence();
    t2 = __rdtsc();         // set stop time
    _mm_lfence();

    diff1 = t2 - t1;        // two fence statements are overhead
    printf( "tmp is %d
diff1 is %lu
", tmp, diff1 );

    /* READ MISS 2 */
    _mm_lfence();           // fence to keep load order
    t1 = __rdtsc();         // set start time
    _mm_lfence();
    tmp = array[ 70 ];      // read the second elemet => cache miss (or hit due to prefetching?!)
    _mm_lfence();
    t2 = __rdtsc();         // set stop time
    _mm_lfence();

    diff2 = t2 - t1;        // two fence statements are overhead
    printf( "tmp is %d
diff2 is %lu
", tmp, diff2 );


    /* READ HIT*/
    _mm_lfence();           // fence to keep load order
    t1 = __rdtsc();         // set start time
    _mm_lfence();
    tmp = array[ 30 ];   // read the first elemet => cache hit
    _mm_lfence();
    t2 = __rdtsc();         // set stop time
    _mm_lfence();

    diff3 = t2 - t1;        // two fence statements are overhead
    printf( "tmp is %d
diff3 is %lu
", tmp, diff3 );


    /* measuring fence overhead */
    _mm_lfence();
    t1 = __rdtsc();
    _mm_lfence();
    _mm_lfence();
    t2 = __rdtsc();
    _mm_lfence();
    ov = t2 - t1;

    printf( "lfence overhead is %lu
", ov );
    printf( "cache miss1 TSC is %lu
", diff1-ov );
    printf( "cache miss2 (or hit due to prefetching) TSC is %lu
", diff2-ov );
    printf( "cache hit TSC is %lu
", diff3-ov );


    return 0;
}

输出是

# gcc -O3 -o simple_flush simple_flush.c
# taskset -c 0 ./simple_flush
tmp is 30
diff1 is 529
tmp is 70
diff2 is 222
tmp is 30
diff3 is 46
lfence overhead is 32
cache miss1 TSC is 497
cache miss2 (or hit due to prefetching) TSC is 190
cache hit TSC is 14
# taskset -c 0 ./simple_flush
tmp is 30
diff1 is 486
tmp is 70
diff2 is 276
tmp is 30
diff3 is 46
lfence overhead is 32
cache miss1 TSC is 454
cache miss2 (or hit due to prefetching) TSC is 244
cache hit TSC is 14
# taskset -c 0 ./simple_flush
tmp is 30
diff1 is 848
tmp is 70
diff2 is 222
tmp is 30
diff3 is 46
lfence overhead is 34
cache miss1 TSC is 814
cache miss2 (or hit due to prefetching) TSC is 188
cache hit TSC is 12

读取array[70]的输出存在一些问题.TSC 既不会被击中也不会失手.我已经刷新了类似于 array[30] 的那个项目.一种可能性是当访问array[40]时,硬件预取器带来array[70].所以,这应该是一个打击.然而,TSC 不仅仅是一个打击.当我第二次尝试读取 array[30] 时,您可以验证命中的 TSC 大约为 20.

There are some problems with the output for reading array[70]. The TSC is neither hit nor miss. I had flushed that item similar to array[30]. One possibility is that when array[40] is accessed, the HW prefetcher brings array[70]. So, that should be a hit. However, the TSC is much more than a hit. You can verify that the hit TSC is about 20 when I try to read array[30] for the second time.

即使没有预取array[70],TSC 也应该类似于缓存未命中.

Even, if array[70] is not prefetched, the TSC should be similar to a cache miss.

有什么原因吗?

更新 1:

为了读取数组,我按照 Peter 和 Hadi 的建议尝试了 (void) *((int*)array+i).

In order to make an array read, I tried (void) *((int*)array+i) as suggested by Peter and Hadi.

在输出中,我看到许多负面结果.我的意思是开销似乎大于 (void) *((int*)array+i)

In the output I see many negative results. I mean the overhead seems to be larger than (void) *((int*)array+i)

更新2:

我忘记添加volatile.结果现在很有意义.

I forgot to add volatile. The results are now meaningful.

推荐答案

首先注意测量diff1diff2后对printf的两次调用code> 可能会扰乱 L1D 甚至 L2 的状态.在我的系统上,使用 printfdiff3-ov 的报告值范围在 4-48 个周期之间(我已经配置了我的系统,使 TSC 频率大约等于核心频率).最常见的值是 L2 和 L3 延迟的值.如果报告的值为 8,那么我们就命中了 L1D 缓存.如果它大于 8,那么前面对 printf 的调用很可能已经从 L1D 和可能的 L2(在一些罕见的情况下,是 L3!)中踢出了目标缓存线,这将解释高于 8 的测量延迟.@PeterCordes 有 建议使用 (void) *((volatile int*)array + i) 而不是 temp = array[i];printf(temp).进行此更改后,我的实验表明,大多数报告的 diff3-ov 测量值正好是 8 个周期(这表明测量误差约为 4 个周期),并且报告的其他唯一值是 0、4 和 12.所以强烈推荐 Peter 的方法.

First, note that the two calls to printf after measuring diff1 and diff2 may perturb the state of the L1D and even the L2. On my system, with printf, the reported values for diff3-ov range between 4-48 cycles (I've configured my system so that the TSC frequency is about equal to the core frequency). The most common values are those of the L2 and L3 latencies. If the reported value is 8, then we've got our L1D cache hit. If it is larger than 8, then most probably the preceding call to printf has kicked out the target cache line from the L1D and possibly the L2 (and in some rare cases, the L3!), which would explain the measured latencies that are higher than 8. @PeterCordes have suggested to use (void) *((volatile int*)array + i) instead of temp = array[i]; printf(temp). After making this change, my experiments show that most reported measurements for diff3-ov are exactly 8 cycles (which suggests that the measurement error is about 4 cycles), and the only other values that get reported are 0, 4, and 12. So Peter's approach is strongly recommended.

一般来说,主存访问延迟取决于很多因素,包括 MMU 缓存的状态和页表遍历器对数据缓存的影响、核心频率、非核心频率、内存的状态和配置由于超线程,控制器和内存芯片与目标物理地址、非核心争用和核心争用有关.array[70] 可能与 array[30] 和它们的加载指令的 IP 和目标内存位置的地址位于不同的虚拟页面(和物理页面)中可能以复杂的方式与预取器交互.所以cache miss1cache miss2 不同的原因有很多.进行彻底的调查是可能的,但正如您想象的那样,这需要付出很多努力.通常,如果您的核心频率大于 1.5 GHz(小于 TSC 频率 在高性能英特尔处理器上),那么 L3 加载未命中将至少需要 60 个核心周期.在您的情况下,两个未命中延迟都超过 100 个周期,因此这些很可能是 L3 未命中.但在极少数情况下,cache miss2 似乎接近 L3 或 L2 延迟范围,这可能是由于预取造成的.

In general, the main memory access latency depends on many factors including the state of the MMU caches and the impact of the page table walkers on the data caches, the core frequency, the uncore frequency, the state and configuration of the memory controller and the memory chips with respect to the target physical address, uncore contention, and on-core contention due to hyperthreading. array[70] might be in a different virtual page (and physical page) than array[30] and their IPs of the load instructions and the addresses of the target memory locations may interact with the prefetchers in complex ways. So there can be many reasons why cache miss1 is different from cache miss2. A thorough investigation is possible, but it would require a lot of effort as you might imagine. Generally, if your core frequency is larger than 1.5 GHz (which is smaller than the TSC frequency on high-perf Intel processors), then an L3 load miss will take at least 60 core cycles. In your case, both miss latencies are over 100 cycles, so these are most likely L3 misses. In some extremely rare cases though, cache miss2 seems to be close to the L3 or L2 latency ranges, which would be due to prefetching.

我已经确定以下代码对 Haswell 进行了统计上更准确的测量:

I've determined that the following code gives a statistically more accurate measurement on Haswell:

t1 = __rdtscp(&dummy);
tmp = *((volatile int*)array + 30);
asm volatile ("add $1, %1
	"
              "add $1, %1
	"
              "add $1, %1
	"
              "add $1, %1
	"
              "add $1, %1
	"
              "add $1, %1
	"
              "add $1, %1
	"
              "add $1, %1
	"
              "add $1, %1
	"
              "add $1, %1
	"
              "add $1, %1
	"
          : "+r" (tmp));          
t2 = __rdtscp(&dummy);
t2 = __rdtscp(&dummy);
loadlatency = t2 - t1 - 60; // 60 is the overhead

loadlatency 为 4 个周期的概率为 97%.loadlatency 为 8 个周期的概率为 1.7%.loadlatency 取其他值的概率是 1.3%.所有其他值都大于 8 且是 4 的倍数.我稍后会尝试添加说明.

The probability that loadlatency is 4 cycles is 97%. The probability that loadlatency is 8 cycles is 1.7%. The probability that loadlatency takes other values is 1.3%. All of the other values are larger than 8 and multiple of 4. I'll try to add an explanation later.

这篇关于使用时间戳计数器测量内存延迟的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆