用于测量功能的rdtsc定时 [英] rdtsc timing for a measuring a function

查看:153
本文介绍了用于测量功能的rdtsc定时的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用rdtsc函数调用。所以我通过以下两种方式来衡量它。


  1. 在循环中调用它。汇总循环内的每个rdtsc差异并除以调用次数。 (假设这是N)

  2. 在循环中调用它。获得循环本身的rdtsc差异并除以N。

但我看到了一些不一致的行为。


  1. 当我增加N时,方法1和方法2中的时间减少得相当单调。对于方法2,可以理解的是,它会分摊循环控制开销。但我不确定方法1是如何的。

  2. 实际上,对于方法2,每当我增加N时,我得到的N = 1的值似乎只是除以新N每次。检查gdb反汇编使我意识到它是在-O2处进行一些编译器优化,其中在第二种情况下跳过循环。所以我重试了-O0,gdb反汇编显示了第二种情况下的实际循环。






代码如下。

  #include< stdio.h> 
#include< inttypes.h>
#include< stdlib.h>

typedef unsigned long long ticks;

static __inline__ ticks getticks(void){
unsigned a,d;
asm volatile(rdtsc:= a(a),= d(d));
return((ticks)a)| (((滴答)d)<32);

$ b $ __attribute__((noinline))
void bar(){

}

int main(int argc ,char ** argv){

long long N = 1000000;
N = atoi(argv [1]);
int i;
long long bar_total = 0;

ticks start = 0,end = 0;

for(i = 0; i start = getticks();
bar();
end = getticks();
bar_total + =(结束 - 开始);
}

fprintf(stdout,Total invocations:%lld\\\
,N);
fprintf(stdout,[regular] bar overhead:%lf\\\
,((double)bar_total / N));

start = getticks();
for(i = 0; i bar();
}
end = getticks();

bar_total =(结束 - 开始);

fprintf(stdout,[Loop] bar overhead:%lf\\\
,((double)bar_total / N));

返回0;

}

任何想法发生了什么?如果需要的话,我可以把gdb反汇编。
我使用了 http://dasher.wustl中的rdtsc实现.edu / tinker / distribution / fftw / kernel / cycle.h

编辑:
我要必须收回我的第二个陈述,即在第二种情况下,在-O0的时间与N成正比下降。我想这是我在编译过程中犯的一些错误,导致一些旧版本持续存在。任何它仍然与方法1的数字一起下降。下面是一些不同N值的数字。

  taskset -c 2 ./example.exe 1 
总调用次数:1
[常规]栏开销:108.000000
[循环]栏开销:138.000000

taskset -c 2。 /example.exe 10
总调用次数:10
[常规]栏开销:52.900000
[循环]栏开销:40.700000

taskset -c 2 ./example .exe 100
总调用次数:100
[regular] bar开销:46.780000
[循环]栏开销:15.570000

taskset -c 2 ./example.exe 1000
总调用次数:1000
[常规]栏开销:46.0​​69000
[循环]栏开销:13.669000

taskset -c 2 ./example.exe 100000
总调用次数:10000
[常规]栏开销:46.0​​10100
[循环]栏开销:13.444900

taskset -c 2 ./example.exe 100000000
总调用次数:100000000
[普通]栏开销:26.970272
[循环]开销:5.201252

taskset -c 2 ./example.exe 1000000000
总调用次数:1000000000
[常规]开销:18.853279
[Loop ] bar开销:5.218234

taskset -c 2 ./example.exe 10000000000
总调用次数:141006540​​8
[常规]栏开销:18.540719
[循环]栏开销:5.216395

现在我看到两个新行为。


  1. 方法1的收敛速度比方法2的慢。但是我仍然困惑于为什么不同的N设置值存在如此巨大的差异。也许我正在做一些基本的错误,目前我还没有看到。

  2. 方法1的值实际上比方法2大一些。我预计它不会包含循环控制开销,因此它与方法2的值相等或略小。

问题



因此,总括来说,我的问题是 $ b


  1. 为什么两种方法给出的数值在增加N时变化很大?特别是对于没有考虑到循环控制开销的方法1.


  2. 当第一个方法排除循环控制时,为什么第二个方法结果小于第一个方法在计算过程中的开销?


    编辑2



    关于建议的rdtscp解决方案。



    对于内联程序集的不熟悉,我做了以下操作。

    pre $ static __inline__ ticks getstart(void){
    unsigned cycles_high = 0,cycles_low = 0;
    asm volatile(CPUID\\\
    \t
    RDTSC\\\
    \t
    mov %% edx,%0\\\
    \t
    mov %% eax,%1\\\
    \t:= r(cycles_high),= r(cycles_low)::
    %rax,%rbx, %rcx,%rdx);
    return((ticks)cycles_high)| (((ticks)cycles_low)<32);


    static __inline__ ticks getend(void){
    unsigned cycles_high = 0,cycles_low = 0;
    asm volatile(RDTSCP\\\
    \t
    mov %% edx,%0\\\
    \t
    mov %% eax,%1\\\
    \ t
    CPUID\\\
    \t:= r(cycles_high),= r(cycles_low)::
    %rax,%rbx, %rcx,%rdx);
    return((ticks)cycles_high)| (((ticks)cycles_low)<32);
    }

    并且在函数调用之前和之后使用上述方法。但是现在我得到如下所示的非理性结果。

    pre $ code总调用次数:1000000
    [常规]栏开销:304743228324.708374
    [循环]开销:33145641307.734016

    有什么问题?我想将这些内联方法分解出来,因为我在多个地方看到它的使用。



    A。解决方案中的评论。

    解决方案

    您可能无法正确使用 rdtsc 无序的CPU,比如Xeons和Cores。您应该添加一些序列化指令或切换到 rdtscp 指令



    http:// en.wikipedia.org/wiki/Time_Stamp_Counter


    从Pentium Pro开始,英特尔处理器支持乱序执行,其中指令不一定按照它们在可执行文件中出现的顺序执行。这可能会导致RDTSC比预期更晚执行,产生误导性的周期计数。[3]通过执行序列化指令(如CPUID)可以解决此问题,以便在允许程序继续之前完成每条先前的指令,或者使用RDTSC指令(RDTSC指令的序列化变体)的RDTSCP指令。


    Intel最近使用rdtsc / rdtscp的手册 - 如何在英特尔IA-32和IA- 64指令集体系结构(ia-32-ia-64-benchmark-code-execution-paper.pdf,324264-001,2010)。他们推荐使用cpuid + rdtsc作为开始,使用rdtscp作为结束计时器: b
    $ b


    解决第0节中提出的问题的方法是添加一个CPUID指令
    紧跟在 RDTPSCP 和两个 mov 指令之后(用于在内存中存储
    edx eax )。执行如下:



      asm volatile(CPUID\\\
    \t
    RDTSC \\\
    \t
    mov %% edx,%0\\\
    \t
    mov %% eax,%1 \\\
    \t := r(cycles_high),= r(cycles_low)::
    %rax,%rbx,%rcx,%rdx);
    / *********************************** /
    / *调用函数在这里测量* /
    / *********************************** /
    asm volatile(RDTSCP \\\
    \t
    mov %% edx,%0\\\
    \t
    mov %% eax,%1 \\\
    \t
    CPUID \\\
    \t:= r(cycles_high1),= r(cycles_low1)::
    %rax,%rbx,%rcx, %RDX); $(b)b
    start =(((uint64_t)cycles_high <= lt; 32)| cycles_low);
    end =(((uint64_t)cycles_high1 <= 32)| cycles_low1);




    在上面的代码中,第一个 CPUID 调用实现了一个屏障,以避免在 RDTSC 指令之上和之下执行乱序
    的乱序。
    然而,这个调用并不影响测量,因为它在
    RDTSC 之前(即在读取时间戳寄存器之前)。
    第一个 RDTSC 然后读取时间戳记寄存器,并且该值存储在
    内存中。
    然后我们想要测量的代码被执行。如果代码是对
    函数的调用,则建议声明 inline 这样的函数,以便从
    汇编视角调用函数本身没有开销。
    > RDTSCP 指令第二次读取时间戳寄存器,
    保证我们要测量的所有代码的执行完成。 p>

    你的例子不是很正确;你尝试测量空函数 bar(),但是它很短,所以你要测量方法1中的rdtsc开销( for(){rdtsc; bar(); rdtsc))。根据haswell雾的表格 - http://www.agner.org/optimize/instruction_tables。 pdf 第191页(长表英特尔Haswell指令时序和μop故障列表,最后)
    RDTSC 有15个优先级(不可能融合)和24个滴答的等待时间; (对于较旧的微体系结构,Sandy Bridge有23个uops和36个ticks延迟,而对于rdtsc,则有21个uops和28个tick)。所以,你不能使用普通的rdtsc(或rdtscp)来直接测量这样的短代码。


    I want to time a function call with rdtsc. So I measured it in two ways as follows.

    1. Call it in a loop. Aggregate each rdtsc difference within the loop and divide by number of calls. (Let's say this is N)
    2. Call it in a loop. Get the rdtsc difference of the loop itself and divide by N.

    But I see couple of inconsistent behaviors.

    1. When I increase N the times get reduced rather monotonically in both method 1 and 2. For method 2 it is understandable in that it would amortize the loop control overhead. But I am not sure how it is so for method 1.
    2. Actually for method 2 each time when I increase the N the value I get for N=1 seems to be just divided by the new N each time. Inspecting gdb disassembly made me realize it is some compiler optimization at -O2 where loop is skipped at second case. So I retried with -O0, where the gdb disassembly show the actual loop being there for the second case as well.


    Code is given below.

        #include <stdio.h>
        #include <inttypes.h>
        #include <stdlib.h>
    
        typedef unsigned long long ticks;
    
        static __inline__ ticks getticks(void) {
          unsigned a, d; 
          asm volatile("rdtsc" : "=a" (a), "=d" (d)); 
          return ((ticks)a) | (((ticks)d) << 32); 
        }
    
        __attribute__ ((noinline))
        void bar() {
    
        }
    
        int main(int argc, char** argv) {
    
           long long N = 1000000; 
           N = atoi(argv[1]);
           int i;
           long long bar_total = 0;
    
           ticks start = 0, end = 0;
    
           for (i = 0; i < N; i++) {
             start = getticks();
             bar();
             end = getticks();
             bar_total += (end - start);
           } 
    
           fprintf(stdout, "Total invocations : %lld\n", N);
           fprintf(stdout, "[regular] bar overhead : %lf\n", ((double)bar_total/  N));
    
          start = getticks();
          for (i = 0; i < N; i++) {
            bar();
          } 
          end = getticks();
    
          bar_total = (end - start);
    
          fprintf(stdout, "[Loop] bar overhead : %lf\n", ((double)bar_total/ N));
    
          return 0;
    
         }
    

    Any idea what's going on here? I can put the gdb disassembly if needed as well. I used the rdtsc implementation from http://dasher.wustl.edu/tinker/distribution/fftw/kernel/cycle.h

    Edit: I am going to have to retract my second statement that at -O0 the time gets dropped directly proportional to N in the second case. I guess it's some mistake I made during the build causing some older version to persist. Any how it still goes down somewhat along with figure for method 1. Here are some numbers for different N values.

    taskset -c 2 ./example.exe 1
    Total invocations : 1
    [regular] bar overhead : 108.000000
    [Loop] bar overhead : 138.000000
    
    taskset -c 2 ./example.exe 10
    Total invocations : 10
    [regular] bar overhead : 52.900000
    [Loop] bar overhead : 40.700000
    
    taskset -c 2 ./example.exe 100
    Total invocations : 100
    [regular] bar overhead : 46.780000
    [Loop] bar overhead : 15.570000
    
    taskset -c 2 ./example.exe 1000
    Total invocations : 1000
    [regular] bar overhead : 46.069000
    [Loop] bar overhead : 13.669000
    
    taskset -c 2 ./example.exe 100000
    Total invocations : 10000
    [regular] bar overhead : 46.010100
    [Loop] bar overhead : 13.444900
    
    taskset -c 2 ./example.exe 100000000
    Total invocations : 100000000
    [regular] bar overhead : 26.970272
    [Loop] bar overhead : 5.201252
    
    taskset -c 2 ./example.exe 1000000000
    Total invocations : 1000000000
    [regular] bar overhead : 18.853279
    [Loop] bar overhead : 5.218234
    
    taskset -c 2 ./example.exe 10000000000
    Total invocations : 1410065408
    [regular] bar overhead : 18.540719
    [Loop] bar overhead : 5.216395
    

    I see two new behaviors now.

    1. Method 1 converges slower than the method 2. But still I am puzzling over why there is such a drastic difference in values for different N settings. Perhaps I am doing some basic mistake here which I don't see at the moment.
    2. Method 1 value is actually larger than method 2 by some margin. I expected it be on par or slightly smaller than the method 2 value since it doesn't contain loop control overhead.

    Questions

    So in summary my questions are

    1. Why are the values given by both methods change so drastically when increasing the N? Specially for method 1 which doesn't account for loop control overhead.

    2. Why is second method result is less than the first method's when first method excludes the loop control overhead in the calculations?

    Edit 2

    Regarding the suggested rdtscp solution.

    Being uninitiated about the inline assembly I did the following.

    static __inline__ ticks getstart(void) {
      unsigned cycles_high = 0, cycles_low = 0; 
      asm volatile ("CPUID\n\t"
                 "RDTSC\n\t"
                 "mov %%edx, %0\n\t"
                 "mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low)::
                 "%rax", "%rbx", "%rcx", "%rdx");
      return ((ticks)cycles_high) | (((ticks)cycles_low) << 32); 
    }
    
    static __inline__ ticks getend(void) {
      unsigned cycles_high = 0, cycles_low = 0; 
      asm volatile("RDTSCP\n\t"
             "mov %%edx, %0\n\t"
              "mov %%eax, %1\n\t"
               "CPUID\n\t": "=r" (cycles_high), "=r" (cycles_low)::
               "%rax", "%rbx", "%rcx", "%rdx");
      return ((ticks)cycles_high) | (((ticks)cycles_low) << 32); 
    }
    

    and used above methods before and after the function call. But now I get non sensical results like follows.

    Total invocations : 1000000
    [regular] bar overhead : 304743228324.708374
    [Loop] bar overhead : 33145641307.734016
    

    What's the catch? I wanted to factor out those as inlined methods since I see use of it in multiple places.

    A. Solution in the comments.

    解决方案

    You use plain rdtsc instruction, which may not work correctly on Out-of-order CPUs, like Xeons and Cores. You should add some serializing instruction or switch to rdtscp instruction:

    http://en.wikipedia.org/wiki/Time_Stamp_Counter

    Starting with the Pentium Pro, Intel processors have supported out-of-order execution, where instructions are not necessarily performed in the order they appear in the executable. This can cause RDTSC to be executed later than expected, producing a misleading cycle count.[3] This problem can be solved by executing a serializing instruction, such as CPUID, to force every preceding instruction to complete before allowing the program to continue, or by using the RDTSCP instruction, which is a serializing variant of the RDTSC instruction.

    Intel has recent manual of using rdtsc/rdtscp - How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures (ia-32-ia-64-benchmark-code-execution-paper.pdf, 324264-001, 2010). They recommend cpuid+rdtsc for start and rdtscp for end timers:

    The solution to the problem presented in Section 0 is to add a CPUID instruction just after the RDTPSCP and the two mov instructions (to store in memory the value of edx and eax). The implementation is as follows:

    asm volatile ("CPUID\n\t"
     "RDTSC\n\t"
     "mov %%edx, %0\n\t"
     "mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low)::
    "%rax", "%rbx", "%rcx", "%rdx");
    /***********************************/
    /*call the function to measure here*/
    /***********************************/
    asm volatile("RDTSCP\n\t"
     "mov %%edx, %0\n\t"
     "mov %%eax, %1\n\t"
     "CPUID\n\t": "=r" (cycles_high1), "=r" (cycles_low1)::
    "%rax", "%rbx", "%rcx", "%rdx");
    
    start = ( ((uint64_t)cycles_high << 32) | cycles_low );
    end = ( ((uint64_t)cycles_high1 << 32) | cycles_low1 );
    

    In the code above, the first CPUID call implements a barrier to avoid out-of-order execution of the instructions above and below the RDTSC instruction. Nevertheless, this call does not affect the measurement since it comes before the RDTSC (i.e., before the timestamp register is read). The first RDTSC then reads the timestamp register and the value is stored in memory. Then the code that we want to measure is executed. If the code is a call to a function, it is recommended to declare such function as "inline" so that from an assembly perspective there is no overhead in calling the function itself. The RDTSCP instruction reads the timestamp register for the second time and guarantees that the execution of all the code we wanted to measure is completed.

    You example is not very correct; you try to measure empty function bar(), but it is so short that you are measuring rdtsc overhead in method 1 (for() { rdtsc; bar(); rdtsc)). According to the Agner Fog's table for haswell - http://www.agner.org/optimize/instruction_tables.pdf page 191 (long table "Intel Haswell List of instruction timings and μop breakdown", at the very end of it) RDTSC has 15 uops (no fusion possible) and the latency of 24 ticks; RDTSCP (for older microarchitecture Sandy Bridge has 23 uops and 36 ticks latency versus 21 uops and 28 ticks for rdtsc). So, you can't use plain rdtsc (or rdtscp) to directly measure such short code.

    这篇关于用于测量功能的rdtsc定时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆