为什么clock_gettime如此不稳定? [英] Why is clock_gettime so erratic?

查看：395 发布时间：2020/5/1 8:51:08 linux time profiling

本文介绍了为什么clock_gettime如此不稳定?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

第 Old Question 部分包含初始问题(此后添加了进一步调查和结论).

Section Old Question contains the initial question (Further Investigation and Conclusion have been added since).

跳到下面的进一步研究部分，以详细比较不同的计时方法(rdtsc，clock_gettime和QueryThreadCycleTime).

Skip to the section Further Investigation below for a detailed comparison of the different timing methods (rdtsc, clock_gettime and QueryThreadCycleTime).

我相信CGT的不稳定行为可以归因于错误的内核或错误的CPU(请参见结论部分).

I believe the erratic behaviour of CGT can be attributed to either a buggy kernel or a buggy CPU (see section Conclusion).

用于测试的代码位于该问题的底部(请参见附录部分).

The code used for testing is at the bottom of this question (see section Appendix).

很抱歉.

简而言之:我正在使用clock_gettime来衡量许多代码段的执行时间.我在单独的运行之间遇到非常不一致的测量.与其他方法相比，该方法具有极高的标准偏差(请参见下面的说明).

In short: I am using clock_gettime to measure the execution time of many code segments. I am experiencing very inconsistent measurements between separate runs. The method has an extremely high standard deviation when compared to other methods (see Explanation below).

问题:与其他方法相比，clock_gettime给出的测量结果如此不一致是有原因的吗?是否存在具有相同分辨率的替代方法来解决线程空闲时间?

Question: Is there a reason why clock_gettime would give so inconsistent measurements when compared to other methods? Is there an alternative method with the same resolution that accounts for thread idle time?

说明:我正在尝试介绍C代码的一小部分.每个代码段的执行时间不超过几微秒.在一次运行中，每个代码段将执行数百次，从而产生runs × hundreds个测量值.

Explanation: I am trying to profile a number of small parts of C code. The execution time of each of the code segments is not more than a couple of microseconds. In a single run, each of the code segments will execute some hundreds of times, which produces runs × hundreds of measurements.

我还必须仅测量线程实际花费在执行上的时间(这就是rdtsc不适合的原因).我还需要高分辨率(这就是times不适合的原因.)

I also have to measure only the time the thread actually spends executing (which is why rdtsc is not suitable). I also need a high resolution (which is why times is not suitable).

我尝试了以下方法:

rdtsc(在Linux和Windows上)

rdtsc (on Linux and Windows),

clock_gettime(在Linux上为"CLOCK_THREAD_CPUTIME_ID"；)和

clock_gettime (with 'CLOCK_THREAD_CPUTIME_ID'; on Linux), and

QueryThreadCycleTime(在Windows上).

QueryThreadCycleTime (on Windows).

方法:分析进行了25次.在每次运行中，单独的代码段重复101次.因此，我有2525个测量值.然后，我查看测量值的直方图，并计算一些基本信息(例如均值，std.dev.，中位数，众数，最小值和最大值).

Methodology: The analysis was performed on 25 runs. In each run, separate code segments repeat a 101 of times. Therefore I have 2525 measurements. Then I look at a histogram of the measurements, and also calculate some basic stuff (like the mean, std.dev., median, mode, min, and max).

我没有介绍如何测量这三种方法的相似性"，但这只是涉及每个代码段中所用时间比例的基本比较(比例"表示时间已归一化).然后，我看看这些比例之间的纯粹差异.该比较表明，在25次运行中取平均值时，所有"rdtsc"，"QTCT"和"CGT"的比例均相同.但是，以下结果表明"CGT"具有非常大的标准偏差.这使它在我的用例中无法使用.

I do not present how I measured the 'similarity' of the three methods, but this simply involved a basic comparison of proportion of times spent in each code segment ('proportion' means that the times are normalised). I then look at the pure differences in these proportions. This comparison showed that all 'rdtsc', 'QTCT', and 'CGT' measure the same proportions when averaged over the 25 runs. However, the results below show that 'CGT' has a very large standard deviation. This makes it unusable in my use case.

结果:

同一代码段的clock_gettime与rdtsc的比较(25次运行101次测量= 2525个读数):

A comparison of clock_gettime with rdtsc for the same code segment (25 runs of 101 measurements = 2525 readings):

clock_gettime :

1881次测量为11 ns，
595个测量值(几乎正常分布)在3369到3414 ns之间，
两次测量11680 ns，
1次测量为1506022 ns，并且
其余时间在900到5000 ns之间.

1881 measurements of 11 ns,
595 measurements were (distributed almost normally) between 3369 and 3414 ns,
2 measurements of 11680 ns,
1 measurement of 1506022 ns, and
the rest is between 900 and 5000 ns.

最小值:11 ns

rdtsc (注意:在此运行期间未发生任何上下文切换，但是如果发生此切换，通常只会进行30000个滴答左右的单次测量):

rdtsc (note: no context switches occurred during this run, but if it happens, it usually results in just a single measurement of 30000 ticks or so):

在274和325滴答之间进行1178次测量，
在326至375个滴答之间进行306次测量，
在376至425滴答之间进行910次测量，
在426至990滴答之间进行129次测量，
对1240个刻度的1次测量，
1个测量值的1256个滴答声.

1178 measurements between 274 and 325 ticks,
306 measurements between 326 and 375 ticks,
910 measurements between 376 and 425 ticks,
129 measurements between 426 and 990 ticks,
1 measurement of 1240 ticks, and
1 measurement of 1256 ticks.

最小值:274滴答声

讨论:

rdtsc在Linux和Windows上给出的结果非常相似.它具有可接受的标准偏差-实际上非常一致/稳定.但是，它不考虑线程空闲时间.因此，上下文切换使测量变得不稳定(在Windows上我经常观察到这种情况:平均有1000个滴答声的代码段有时会不时地〜30000个滴答声-肯定是由于抢占.)

rdtsc gives very similar results on both Linux and Windows. It has an acceptable standard deviation--it is actually quite consistent/stable. However, it does not account for thread idle time. Therefore, context switches make the measurements erratic (on Windows I have observed this quite often: a code segment with an average of 1000 ticks or so will take ~30000 ticks every now and then--definitely because of pre-emption).

QueryThreadCycleTime提供非常一致的测量-即与rdtsc相比，标准偏差要低得多.当没有上下文切换发生时，此方法几乎与rdtsc相同.

QueryThreadCycleTime gives very consistent measurements--i.e. much lower standard deviation when compared to rdtsc. When no context switches happen, this method is almost identical to rdtsc.

clock_gettime产生非常不一致的结果(不仅在运行之间，而且在测量之间).标准偏差非常大(与rdtsc相比).

clock_gettime, on the other hand, is producing extremely inconsistent results (not just between runs, but also between measurements). The standard deviations are extreme (when compared to rdtsc).

我希望统计数字可以.但是，这两种方法之间的测量值出现如此差异的原因可能是什么?当然，有缓存，CPU/核心迁移等.但是，这一切都不应对"rdtsc"和"clock_gettime"之间的任何此类差异负责.发生了什么事?

I hope the statistics are okay. But what could be the reason for such a discrepancy in the measurements between the two methods? Of course, there is caching, CPU/core migration, and other things. But none of this should be responsible for any such differences between 'rdtsc' and 'clock_gettime'. What is going on?

我对此进行了进一步调查.我做了两件事:

I have investigated this a bit further. I have done two things:

测量了仅调用clock_gettime(CLOCK_THREAD_CPUTIME_ID, &t)的开销(请参见附录中的代码1)，以及

Measured the overhead of just calling clock_gettime(CLOCK_THREAD_CPUTIME_ID, &t) (see code 1 in Appendix), and

在称为clock_gettime的普通循环中并将读数存储到数组中(请参阅附录中的代码2 ).我测量增量时间(连续测量时间中的差异，应该与clock_gettime调用的开销相对应).

in a plain loop called clock_gettime and stored the readings into an array (see code 2 in Appendix). I measure the delta times (difference in successive measurement times, which should correspond a bit to the overhead of the call of clock_gettime).

我已经在具有两个不同Linux Kernel版本的两台不同计算机上对其进行了测量:

I have measured it on two different computers with two different Linux Kernel versions:

CGT :

CPU :Core 2 Duo L9400 @ 1.86GHz

CPU: Core 2 Duo L9400 @ 1.86GHz

内核:Linux 2.6.40-4.fc15.i686#1 SMP Fri Jul 29 18:54:39 UTC 2011 i686 i686 i386

Kernel: Linux 2.6.40-4.fc15.i686 #1 SMP Fri Jul 29 18:54:39 UTC 2011 i686 i686 i386

结果:

估计的clock_gettime开销:在690-710 ns之间
增量时间:

平均:815.22 ns
中位数:713 ns
模式:709 ns
最小值:698 ns
最大:23359 ns
直方图(遗漏范围的频率为0)

Average: 815.22 ns
Median: 713 ns
Mode: 709 ns
Min: 698 ns
Max: 23359 ns
Histogram (left-out ranges have frequencies of 0):

      Range       |  Frequency
------------------+-----------
  697 < x ≤ 800   ->     78111  <-- cached?
  800 < x ≤ 1000  ->     16412
 1000 < x ≤ 1500  ->         3
 1500 < x ≤ 2000  ->      4836  <-- uncached?
 2000 < x ≤ 3000  ->       305
 3000 < x ≤ 5000  ->       161
 5000 < x ≤ 10000 ->       105
10000 < x ≤ 15000 ->        53
15000 < x ≤ 20000 ->         8
20000 < x         ->         5

CPU :4个双核AMD Opteron处理器275

CPU: 4 × Dual Core AMD Opteron Processor 275

内核:Linux 2.6.26-2-amd64#1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64 GNU/Linux

Kernel: Linux 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64 GNU/Linux

结果:

估计的clock_gettime开销:在279-283 ns之间
增量时间:

平均水平:320.00
中位数:1
模式:1
最低:1
最大:3495529
直方图(遗漏范围的频率为0)

Average: 320.00
Median: 1
Mode: 1
Min: 1
Max: 3495529
Histogram (left-out ranges have frequencies of 0):

      Range         |  Frequency
--------------------+-----------
          x ≤ 1     ->     86738  <-- cached?
    282 < x ≤ 300   ->     13118  <-- uncached?
    300 < x ≤ 440   ->        78
   2000 < x ≤ 5000  ->        52
   5000 < x ≤ 30000 ->         5
3000000 < x         ->         8

RDTSC :

相关代码rdtsc_delta.c和rdtsc_overhead.c.

CPU :Core 2 Duo L9400 @ 1.86GHz

CPU: Core 2 Duo L9400 @ 1.86GHz

内核:Linux 2.6.40-4.fc15.i686#1 SMP Fri Jul 29 18:54:39 UTC 2011 i686 i686 i386

Kernel: Linux 2.6.40-4.fc15.i686 #1 SMP Fri Jul 29 18:54:39 UTC 2011 i686 i686 i386

结果:

估计的开销:在39-42个滴答之间
增量时间:

Estimated overhead: between 39-42 ticks
Delta times:

平均水平:52.46个滴答声
中位数:42个滴答声
模式:42个滴答声
最少:35个滴答声
最多:28700个滴答声
直方图(遗漏范围的频率为0)

Average: 52.46 ticks
Median: 42 ticks
Mode: 42 ticks
Min: 35 ticks
Max: 28700 ticks
Histogram (left-out ranges have frequencies of 0):

      Range       |  Frequency
------------------+-----------
   34 < x ≤ 35    ->     16240  <-- cached?
   41 < x ≤ 42    ->     63585  <-- uncached? (small difference)
   48 < x ≤ 49    ->     19779  <-- uncached?
   49 < x ≤ 120   ->       195
 3125 < x ≤ 5000  ->       144
 5000 < x ≤ 10000 ->        45
10000 < x ≤ 20000 ->         9
20000 < x         ->         2

CPU :4个双核AMD Opteron处理器275

CPU: 4 × Dual Core AMD Opteron Processor 275

内核:Linux 2.6.26-2-amd64#1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64 GNU/Linux

Kernel: Linux 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64 GNU/Linux

结果:

估计的开销:在13.7-17.0滴答之间
增量时间:

Estimated overhead: between 13.7-17.0 ticks
Delta times:

平均水平:35.44个滴答声
中位数:16个滴答声
模式:16个滴答声
最少:14个滴答声
最大:16372个滴答声
直方图(遗漏范围的频率为0)

Average: 35.44 ticks
Median: 16 ticks
Mode: 16 ticks
Min: 14 ticks
Max: 16372 ticks
Histogram (left-out ranges have frequencies of 0):

      Range       |  Frequency
------------------+-----------
   13 < x ≤ 14    ->       192
   14 < x ≤ 21    ->     78172  <-- cached?
   21 < x ≤ 50    ->     10818
   50 < x ≤ 103   ->     10624  <-- uncached?
 5825 < x ≤ 6500  ->        88
 6500 < x ≤ 8000  ->        88
 8000 < x ≤ 10000 ->        11
10000 < x ≤ 15000 ->         4
15000 < x ≤ 16372 ->         2

QTCT :

相关代码qtct_delta.c和qtct_overhead.c.

CPU :Core 2 6700 @ 2.66GHz

CPU: Core 2 6700 @ 2.66GHz

内核:Windows 7 64位

Kernel: Windows 7 64-bit

结果:

估计的开销:在890-940滴答之间
增量时间:

Estimated overhead: between 890-940 ticks
Delta times:

平均水平:1057.30滴答声
中位数:890 ticks
模式:890滴答声
最少:880个滴答声
最多:29400个滴答声
直方图(遗漏范围的频率为0)

Average: 1057.30 ticks
Median: 890 ticks
Mode: 890 ticks
Min: 880 ticks
Max: 29400 ticks
Histogram (left-out ranges have frequencies of 0):

      Range       |  Frequency
------------------+-----------
  879 < x ≤ 890   ->     71347  <-- cached?
  895 < x ≤ 1469  ->       844
 1469 < x ≤ 1600  ->     27613  <-- uncached?
 1600 < x ≤ 2000  ->        55
 2000 < x ≤ 4000  ->        86
 4000 < x ≤ 8000  ->        43
 8000 < x ≤ 16000 ->        10
16000 < x         ->         1

结论

我相信，我的问题的答案将是在我的机器上执行错误的实现(装有旧式Linux内核的AMD CPU).

Conclusion

I believe the answer to my question would be a buggy implementation on my machine (the one with AMD CPUs with an old Linux kernel).

使用旧内核的AMD机器的CGT结果显示了一些极端的读数.如果我们查看增量时间，就会发现最频繁的增量是1 ns.这意味着对clock_gettime的调用花费了不到一纳秒的时间！而且，它还产生了许多非凡的大三角洲(超过3000000 ns)！这似乎是错误的行为. (也许是无法说明的核心迁移?)

The CGT results of the AMD machine with the old kernel show some extreme readings. If we look at the delta times, we'll see that the most frequent delta is 1 ns. This means that the call to clock_gettime took less than a nanosecond! Moreover, it also produced a number of extraordinary large deltas (of more than 3000000 ns)! This seems to be erroneous behaviour. (Maybe unaccounted core migrations?)

备注:

CGT和QTCT的开销很大.

The overhead of CGT and QTCT is quite big.

也很难考虑它们的开销，因为CPU缓存似乎有很大的不同.

It is also difficult to account for their overhead, because CPU caching seems to make quite a big difference.

也许坚持RDTSC，将进程锁定在一个内核上，并分配实时优先级，这是最准确的方式来告诉您一段代码使用了多少个周期...

Maybe sticking to RDTSC, locking the process to one core, and assigning real-time priority is the most accurate way to tell how many cycles a piece of code used...

代码1 :clock_gettime_overhead.c

#include <time.h>
#include <stdio.h>
#include <stdint.h>

/* Compiled & executed with:

    gcc clock_gettime_overhead.c -O0 -lrt -o clock_gettime_overhead
    ./clock_gettime_overhead 100000
*/

int main(int argc, char **args) {
    struct timespec tstart, tend, dummy;
    int n, N;
    N = atoi(args[1]);
    clock_gettime(CLOCK_THREAD_CPUTIME_ID, &tstart);
    for (n = 0; n < N; ++n) {
        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
    }
    clock_gettime(CLOCK_THREAD_CPUTIME_ID, &tend);
    printf("Estimated overhead: %lld ns\n",
            ((int64_t) tend.tv_sec * 1000000000 + (int64_t) tend.tv_nsec
                    - ((int64_t) tstart.tv_sec * 1000000000
                            + (int64_t) tstart.tv_nsec)) / N / 10);
    return 0;
}

代码2 :clock_gettime_delta.c

#include <time.h>
#include <stdio.h>
#include <stdint.h>

/* Compiled & executed with:

    gcc clock_gettime_delta.c -O0 -lrt -o clock_gettime_delta
    ./clock_gettime_delta > results
*/

#define N 100000

int main(int argc, char **args) {
    struct timespec sample, results[N];
    int n;
    for (n = 0; n < N; ++n) {
        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &sample);
        results[n] = sample;
    }
    printf("%s\t%s\n", "Absolute time", "Delta");
    for (n = 1; n < N; ++n) {
        printf("%lld\t%lld\n",
               (int64_t) results[n].tv_sec * 1000000000 + 
                   (int64_t)results[n].tv_nsec,
               (int64_t) results[n].tv_sec * 1000000000 + 
                   (int64_t) results[n].tv_nsec - 
                   ((int64_t) results[n-1].tv_sec * 1000000000 + 
                        (int64_t)results[n-1].tv_nsec));
    }
    return 0;
}

代码3 :rdtsc.h

static uint64_t rdtsc() {
#if defined(__GNUC__)
#   if defined(__i386__)
    uint64_t x;
    __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
    return x;
#   elif defined(__x86_64__)
    uint32_t hi, lo;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ((uint64_t)lo) | ((uint64_t)hi << 32);
#   else
#       error Unsupported architecture.
#   endif
#elif defined(_MSC_VER)
    return __rdtsc();
#else
#   error Other compilers not supported...
#endif
}

代码4 :rdtsc_delta.c

#include <stdio.h>
#include <stdint.h>
#include "rdtsc.h"

/* Compiled & executed with:

    gcc rdtsc_delta.c -O0 -o rdtsc_delta
    ./rdtsc_delta > rdtsc_delta_results

Windows:

    cl -Od rdtsc_delta.c
    rdtsc_delta.exe > windows_rdtsc_delta_results
*/

#define N 100000

int main(int argc, char **args) {
    uint64_t results[N];
    int n;
    for (n = 0; n < N; ++n) {
        results[n] = rdtsc();
    }
    printf("%s\t%s\n", "Absolute time", "Delta");
    for (n = 1; n < N; ++n) {
        printf("%lld\t%lld\n", results[n], results[n] - results[n-1]);
    }
    return 0;
}

代码5 :rdtsc_overhead.c

#include <time.h>
#include <stdio.h>
#include <stdint.h>
#include "rdtsc.h"

/* Compiled & executed with:

    gcc rdtsc_overhead.c -O0 -lrt -o rdtsc_overhead
    ./rdtsc_overhead 1000000 > rdtsc_overhead_results

Windows:

    cl -Od rdtsc_overhead.c
    rdtsc_overhead.exe 1000000 > windows_rdtsc_overhead_results
*/

int main(int argc, char **args) {
    uint64_t tstart, tend, dummy;
    int n, N;
    N = atoi(args[1]);
    tstart = rdtsc();
    for (n = 0; n < N; ++n) {
        dummy = rdtsc();
        dummy = rdtsc();
        dummy = rdtsc();
        dummy = rdtsc();
        dummy = rdtsc();
        dummy = rdtsc();
        dummy = rdtsc();
        dummy = rdtsc();
        dummy = rdtsc();
        dummy = rdtsc();
    }
    tend = rdtsc();
    printf("%G\n", (double)(tend - tstart)/N/10);
    return 0;
}

代码6 :qtct_delta.c

#include <stdio.h>
#include <stdint.h>
#include <Windows.h>

/* Compiled & executed with:

    cl -Od qtct_delta.c
    qtct_delta.exe > windows_qtct_delta_results
*/

#define N 100000

int main(int argc, char **args) {
    uint64_t ticks, results[N];
    int n;
    for (n = 0; n < N; ++n) {
        QueryThreadCycleTime(GetCurrentThread(), &ticks);
        results[n] = ticks;
    }
    printf("%s\t%s\n", "Absolute time", "Delta");
    for (n = 1; n < N; ++n) {
        printf("%lld\t%lld\n", results[n], results[n] - results[n-1]);
    }
    return 0;
}

代码7 :qtct_overhead.c

#include <stdio.h>
#include <stdint.h>
#include <Windows.h>

/* Compiled & executed with:

    cl -Od qtct_overhead.c
    qtct_overhead.exe 1000000
*/

int main(int argc, char **args) {
    uint64_t tstart, tend, ticks;
    int n, N;
    N = atoi(args[1]);
    QueryThreadCycleTime(GetCurrentThread(), &tstart);
    for (n = 0; n < N; ++n) {
        QueryThreadCycleTime(GetCurrentThread(), &ticks);
        QueryThreadCycleTime(GetCurrentThread(), &ticks);
        QueryThreadCycleTime(GetCurrentThread(), &ticks);
        QueryThreadCycleTime(GetCurrentThread(), &ticks);
        QueryThreadCycleTime(GetCurrentThread(), &ticks);
        QueryThreadCycleTime(GetCurrentThread(), &ticks);
        QueryThreadCycleTime(GetCurrentThread(), &ticks);
        QueryThreadCycleTime(GetCurrentThread(), &ticks);
        QueryThreadCycleTime(GetCurrentThread(), &ticks);
        QueryThreadCycleTime(GetCurrentThread(), &ticks);
    }
    QueryThreadCycleTime(GetCurrentThread(), &tend);
    printf("%G\n", (double)(tend - tstart)/N/10);
    return 0;
}

为什么clock_gettime如此不稳定? [英] Why is clock_gettime so erratic?

问题描述

结论

Conclusion

推荐答案

相关文章

服务器开发最新文章

热门教程

热门工具

登录关闭

为什么clock_gettime如此不稳定? [英] Why is clock_gettime so erratic?

问题描述

结论

Conclusion

推荐答案

相关文章

服务器开发最新文章

热门教程

热门工具

登录 关闭

登录关闭