比较rdtsc clock和c ++ 11 std :: chrono :: high_resolution_clock产生的时间测量结果 [英] Comparing the time measured results produced by rdtsc clock and c++11 std::chrono::high_resolution_clock

查看:132
本文介绍了比较rdtsc clock和c ++ 11 std :: chrono :: high_resolution_clock产生的时间测量结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试比较c ++ 11 std::chrono::high_resolution_clock和下面的rdtsc_clock时钟所测量的时间.从high_resolution_clock得到的结果是11000、3000、1000、0.从rdtsc_clock得到的结果是134、15、91等.为什么它们的结果看起来如此不同?从我的直觉出发,我相信rdtsc_clock正在呈现〜准确的结果,对吗?

    template<std::intmax_t clock_freq>
    struct rdtsc_clock {
        typedef unsigned long long rep;
        typedef std::ratio<1, clock_freq> period;
        typedef std::chrono::duration<rep, period> duration;
        typedef std::chrono::time_point<rdtsc_clock> time_point;
        static const bool is_steady = true;

        static time_point now() noexcept
        {

            unsigned lo, hi;
            asm volatile("rdtsc" : "=a" (lo), "=d" (hi));

            return time_point(duration(static_cast<rep>(hi) << 32 | lo));
        }
    };

时间码:

typedef std::chrono::high_resolution_clock Clock;
//typedef rdtsc_clock<3300000000> Clock;
typedef std::chrono::nanoseconds nanoseconds;
typedef std::chrono::duration<double, typename Clock::period> Cycle;
for(int n=0; n < 10; n++){
   auto t1 = Clock::now();
   //codes
   auto t2 = Clock::now();
   printf(%.0f ns \n", duration_cast<nanoseconds>(Cycle(t2 - t1)).count());
}

解决方案

使用RDTSC的问题

如果您阅读了有关RDTSC的一些在线文档,您会发现它无法确保RDTSC指令本身运行之前,管道中未执行RDTSC指令之后的指令(也不会运行较早的指令)然后).通常的建议是在RDTSC之前和/或之后立即使用CPUID指令来触发此类序列点".显然,这会影响程序性能,对于某些类型的测量而言,它比其他类型的测量更可取(平均吞吐量数据比单个样本更引人关注).您可以期望Standard库实现对此更加谨慎,这可能是其度量值要高得多的原因之一.

跨核心问题(根据您的评论不相关)

每个CPU内核都维护自己的TSC寄存器...如果您只是在未绑定到某个内核的线程上或在未绑定到同一内核的多个线程上开始采样,则可能会看到怪异"的值跳跃.一些公司(例如Microsoft)坚持认为,硬件抽象层(HAL)负责尝试使寄存器尽可能接近同步,但是许多(甚至是全新的高端)PC根本无法做到这一点.

您可以通过绑定到核心来解决此问题,或者通过执行一些测量跨核心增量(具有一定的校准误差余量)的校准步骤,然后根据采样的核心来调整以后的样本(在大多数CPU上确定它本身就很麻烦-您需要在CPUID指令或类似指令之间旋转采样.

I am trying to compare the times measured by c++11 std::chrono::high_resolution_clock and the rdtsc_clock clock at below. From the high_resolution_clock, I am getting result like 11000, 3000, 1000, 0. From the rdtsc_clock, I am getting 134, 15, 91 etc. Why their result look so differently? From my gut feeling, I believe the rdtsc_clock is presenting the ~accurate results, am I right?

    template<std::intmax_t clock_freq>
    struct rdtsc_clock {
        typedef unsigned long long rep;
        typedef std::ratio<1, clock_freq> period;
        typedef std::chrono::duration<rep, period> duration;
        typedef std::chrono::time_point<rdtsc_clock> time_point;
        static const bool is_steady = true;

        static time_point now() noexcept
        {

            unsigned lo, hi;
            asm volatile("rdtsc" : "=a" (lo), "=d" (hi));

            return time_point(duration(static_cast<rep>(hi) << 32 | lo));
        }
    };

The timing code:

typedef std::chrono::high_resolution_clock Clock;
//typedef rdtsc_clock<3300000000> Clock;
typedef std::chrono::nanoseconds nanoseconds;
typedef std::chrono::duration<double, typename Clock::period> Cycle;
for(int n=0; n < 10; n++){
   auto t1 = Clock::now();
   //codes
   auto t2 = Clock::now();
   printf(%.0f ns \n", duration_cast<nanoseconds>(Cycle(t2 - t1)).count());
}

解决方案

Issues with RDTSC usage

If you read some online docs on RDTSC, you'll see that it doesn't ensure instructions from after the RDTSC instruction aren't executed in the pipeline before the RDTSC instruction itself runs (nor that earlier instructions don't run afterwards). The normal advice is to use a CPUID instruction immediately before and/or after the RDTSC to trigger such "sequence points". Obviously this impacts program performance, and is more desirable for some kinds of measurements than others (where average throughput figures are of more interest than individual samples). You can expect that the Standard library implementation's being much more careful about this, which may be one reason its measurements are far higher.

Cross-Core Issues (not relevant per your comment)

Each CPU core maintains its own TSC register... if you just start taking samples on a thread that's not bound to a core, or on multiple threads not bound to the same core, you may see "weird" jumps in values. Some companies (e.g. Microsoft) insist that the Hardware Abstraction Laye (HAL) is responsible for trying to get the registers as close to in-sync as possible, but many (even brand new high end) PCs simply fail to do this.

You can get around this by binding to a core, or by doing some calibration step that measures the cross-core deltas (with some calibration error margin), then adjust later samples based on the core from which they're sampled (which itself is painful to determine on most CPUs - you'll need to spin taking samples between CPUID instructions or something similar).

这篇关于比较rdtsc clock和c ++ 11 std :: chrono :: high_resolution_clock产生的时间测量结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆