方差RDTSC开销 [英] Variance in RDTSC overhead

查看:213
本文介绍了方差RDTSC开销的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我构建的微基准来衡量性能的变化,因为我与一些原始图像处理操作使用SIMD指令内在的实验。然而,编写有用的微基准测试是很难的,所以我想先了解(如果可能的话消除)变化及误差尽可能多的资源。

这是我必须考虑的一个因素是测量code本身的开销。我与RDTSC测量,而且我用下面的code找到了测量开销:

  extern内联无符号长长__attribute __((always_inline))rdtsc64(){
    无符号整型HI,LO;
        __asm​​__ __volatile __(
            xorl %% EAX,EAX %% \\ n \\ t的
            CPUID \\ n \\ t的
            RDTSC
        := A(LO),= D(HI)
        / *没有输入* /
        :RBX,RCX);
    返回((无符号很长很长)喜<< 32ull)| (无符号长长)LO;
}unsigned int类型find_rdtsc_overhead(){
    const int的试验= 1000000;    的std ::矢量<无符号长长>倍;
    times.resize(试验,0.0);    的for(int i = 0; I<审判; ++ I){
        无符号长长t_begin = rdtsc64();
        无符号长长t_end = rdtsc64();
        次[I] =(t_end - t_begin);
    }    //循环计数的打印频率
}

在运行此code,我得到的输出是这样的:

 出现的频率(达到1000000试验):
234次(计28次)
243次(计875703次)
252次(计124194次)
261次(计37次)
270个周期(计数2次)
693个周期(1计次)
1611周期(1计次)
1665周期(1计次)
...(较大倍一堆每次只看过一次)

我的问题是:


  1. 什么是由code上面生成周期盘点的双峰分布的可能原因?

  2. 为什么在最快的时间(234次)只出现次数与MDASH屈指可数;什么极不寻常的情况下可能的降低的计数


更多信息

平台:


  • 的Linux 2.6.32(Ubuntu的10.04)

  • 先按g ++ 4.4.3

  • Core 2 Duo处理器(E6600);这有恒定的速率TSC。

的Speed​​Step已被关闭(处理器被设置为性能模式和为2.4GHz运行);如果'按需'模式下运行,我得到243两峰和252个循环,在360两(presumably相应的)高峰和369个循环。

我用了sched_setaffinity 来的过程中锁定一个核心。如果用完上依次在每个核心的测试(即,锁核0和运行,然后锁定到铁心1和运行),我得到两个芯相似的结果,除了使用234个周期的最快时间趋向于略微发生核心少1倍以上核心0。

编译命令是:

  G ++ -Wall -mssse3 -mtune = core2的-O3 -o TEST.bin,烧写TEST.CPP

这GCC生成的核心循环中的code是:

  .L105:
#APP
#27TEST.CPP1
    xorl%EAX,EAX%
    CPUID
    RDTSC
#0,2
#NO_APP
    MOVL%EDX,EBP%
    MOVL%EAX,EDI%
#APP
#27TEST.CPP1
    xorl%EAX,EAX%
    CPUID
    RDTSC
#0,2
#NO_APP
    salq $ 32%的RDX
    salq $ 32%RBP
    MOV EAX%,%EAX
    MOV%EDI,EDI%
    ORQ%RAX,RDX%
    ORQ%RDI,RBP%
    SUBQ%RBP,RDX%
    MOVQ%的RDX(%R8,RSI%)
    addq $ 8%RSI
    cmpq $ 8000000,RSI%
    JNE .L105


解决方案

RDTSC 可以返回不一致的结果了一些原因:


  • 在某些CPU(尤其是较旧的一定Opteron处理器),TSC没有核心之间的同步。这听起来像你已经使用了sched_setaffinity 处理这一点 - 好!

  • 如果您同时code运行OS定时器中断火灾,就会有在它运行时的延迟推出。有避免这种没有可行的方法;刚刚抛出异常高值。

  • 在CPU流水线文物有时可以把你扔出通过在紧密循环任一方向的几个周期。这是完全可能的有,在时钟周期的一个非整数运行一些环路。

  • 缓存!根据不同的CPU缓存的变幻莫测,内存操作(如写入倍[] ),可以在速度变化。在这种情况下,你是幸运的,正在使用的的std ::矢量的实施只是一个平面阵列;即便如此,那可以写扔东西了。这可能是此code最显著的因素。

我够不上酷睿2微架构一个大师说究竟为什么你要这样双峰分布,或你如何code跑得更快的28倍,但它可能有一些做的一个由于上述原因。

I'm constructing a micro-benchmark to measure performance changes as I experiment with the use of SIMD instruction intrinsics in some primitive image processing operations. However, writing useful micro-benchmarks is difficult, so I'd like to first understand (and if possible eliminate) as many sources of variation and error as possible.

One factor that I have to account for is the overhead of the measurement code itself. I'm measuring with RDTSC, and I'm using the following code to find the measurement overhead:

extern inline unsigned long long __attribute__((always_inline)) rdtsc64() {
    unsigned int hi, lo;
        __asm__ __volatile__(
            "xorl %%eax, %%eax\n\t"
            "cpuid\n\t"
            "rdtsc"
        : "=a"(lo), "=d"(hi)
        : /* no inputs */
        : "rbx", "rcx");
    return ((unsigned long long)hi << 32ull) | (unsigned long long)lo;
}

unsigned int find_rdtsc_overhead() {
    const int trials = 1000000;

    std::vector<unsigned long long> times;
    times.resize(trials, 0.0);

    for (int i = 0; i < trials; ++i) {
        unsigned long long t_begin = rdtsc64();
        unsigned long long t_end = rdtsc64();
        times[i] = (t_end - t_begin);
    }

    // print frequencies of cycle counts
}

When running this code, I get output like this:

Frequency of occurrence (for 1000000 trials):
234 cycles (counted 28 times)
243 cycles (counted 875703 times)
252 cycles (counted 124194 times)
261 cycles (counted 37 times)
270 cycles (counted 2 times)
693 cycles (counted 1 times)
1611 cycles (counted 1 times)
1665 cycles (counted 1 times)
... (a bunch of larger times each only seen once)

My questions are these:

  1. What are the possible causes of the bi-modal distribution of cycle counts generated by the code above?
  2. Why does the fastest time (234 cycles) only occur a handful of times—what highly unusual circumstance could reduce the count?


Further Information

Platform:

  • Linux 2.6.32 (Ubuntu 10.04)
  • g++ 4.4.3
  • Core 2 Duo (E6600); this has constant rate TSC.

SpeedStep has been turned off (processor is set to performance mode and is running at 2.4GHz); if running in 'ondemand' mode, I get two peaks at 243 and 252 cycles, and two (presumably corresponding) peaks at 360 and 369 cycles.

I'm using sched_setaffinity to lock the process to one core. If I run the test on each core in turn (i.e., lock to core 0 and run, then lock to core 1 and run), I get similar results for the two cores, except that the fastest time of 234 cycles tends to occur slightly fewer times on core 1 than on core 0.

Compile command is:

g++ -Wall -mssse3 -mtune=core2 -O3 -o test.bin test.cpp

The code that GCC generates for the core loop is:

.L105:
#APP
# 27 "test.cpp" 1
    xorl %eax, %eax
    cpuid
    rdtsc
# 0 "" 2
#NO_APP
    movl    %edx, %ebp
    movl    %eax, %edi
#APP
# 27 "test.cpp" 1
    xorl %eax, %eax
    cpuid
    rdtsc
# 0 "" 2
#NO_APP
    salq    $32, %rdx
    salq    $32, %rbp
    mov %eax, %eax
    mov %edi, %edi
    orq %rax, %rdx
    orq %rdi, %rbp
    subq    %rbp, %rdx
    movq    %rdx, (%r8,%rsi)
    addq    $8, %rsi
    cmpq    $8000000, %rsi
    jne .L105

解决方案

RDTSC can return inconsistent results for a number of reasons:

  • On some CPUs (especially certain older Opterons), the TSC isn't synchronized between cores. It sounds like you're already handling this by using sched_setaffinity -- good!
  • If the OS timer interrupt fires while your code is running, there'll be a delay introduced while it runs. There's no practical way to avoid this; just throw out unusually high values.
  • Pipelining artifacts in the CPU can sometimes throw you off by a few cycles in either direction in tight loops. It's perfectly possible to have some loops that run in a non-integer number of clock cycles.
  • Cache! Depending on the vagaries of the CPU cache, memory operations (like the write to times[]) can vary in speed. In this case, you're fortunate that the std::vector implementation being used is just a flat array; even so, that write can throw things off. This is probably the most significant factor for this code.

I'm not enough of a guru on the Core2 microarchitecture to say exactly why you're getting this bimodal distribution, or how your code ran faster those 28 times, but it probably has something to do with one of the reasons above.

这篇关于方差RDTSC开销的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆