使用时间戳计数器获取时间戳 [英] Using Time stamp counter to get the time stamp

查看:79
本文介绍了使用时间戳计数器获取时间戳的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用下面的代码来获取处理器的时钟周期

I have used the below code to get the clock cycle of the processor

unsigned long long rdtsc(void)
{
  unsigned hi, lo;
  __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
  return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}

我得到一些值,比如 43,但这里的单位是什么?是微秒还是纳秒.

I get some value say 43, but what is the unit here? Is it in microseconds or nanoseconds.

我使用下面的代码来获取我的电路板的频率.

I used below code to get the frequency of my board.

cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq
1700000

我还使用下面的代码来查找我的处理器速度

I also used below code to find my processor speed

dmidecode -t processor | grep "Speed"
Max Speed: 3700 MHz
Current Speed: 3700 MHz

现在如何使用上述频率并将其转换为微秒或毫秒?

Now how do I use above frequency and convert it to microseconds or milliseconds?

推荐答案

对上述问题的简单回答,我如何将 TSC 频率转换为微秒或毫秒?" 是:你做不是.TSC(时间戳计数器)时钟频率实际上是什么,因硬件而异,并且在某些情况下在运行时可能会有所不同.要测量实时时间,您可以在 Linux 中使用 clock_gettime(CLOCK_REALTIME)clock_gettime(CLOCK_MONOTONIC).

A simple answer to the stated question, "how do I convert the TSC frequency to microseconds or milliseconds?" is: You do not. What the TSC (Time Stamp Counter) clock frequency actually is, varies depending on the hardware, and may vary during runtime on some. To measure real time, you use clock_gettime(CLOCK_REALTIME) or clock_gettime(CLOCK_MONOTONIC) in Linux.

正如 Peter Cordes 在评论(2018 年 8 月)中提到的,在大多数当前的 x86-64 架构上,时间戳计数器(通过 RDTSC 指令和 __rdtsc() 函数在 <x86intrin.h>) 计算参考时钟周期,而不是 CPU 时钟周期.他的对 C++ 中类似问题的回答 也适用于 x86-64 上的 Linux,因为编译器提供了底层编译 C 或 C++ 时内置,其余的答案涉及硬件细节.我也推荐阅读那一本.

As Peter Cordes mentioned in a comment (Aug 2018), on most current x86-64 architectures the Time Stamp Counter (accessed by the RDTSC instruction and __rdtsc() function declared in <x86intrin.h>) counts reference clock cycles, not CPU clock cycles. His answer to a similar question in C++ is valid for C also in Linux on x86-64, because the compiler provides the underlying built-in when compiling C or C++, and rest of the answer deals with the hardware details. I recommend reading that one, too.

这个答案的其余部分假设潜在问题是微基准代码,以找出某些函数的两个实现如何相互比较.

The rest of this answer assumes the underlying issue is microbenchmarking code, to find out how two implementations of some function compare to each other.

在 x86(Intel 32 位)和 x86-64(AMD64、Intel 和 AMD 64 位)架构上,您可以使用 中的 __rdtsc() 找出经过的 TSC 时钟周期数.这可用于测量和比较某些函数的不同实现所使用的周期数,通常是大量次数.

On x86 (Intel 32-bit) and x86-64 (AMD64, Intel and AMD 64-bit) architectures, you can use __rdtsc() from <x86intrin.h> to find out the number of TSC clock cycles elapsed. This can be used to measure and compare the number of cycles used by different implementations of some function, typically a large number of times.

请注意,TSC 时钟与 CPU 时钟的关系存在硬件差异.上面提到的最近的答案对此进行了一些详细说明.在 Linux 中,出于实际目的,在 Linux 中使用 cpufreq-set 来禁用频率缩放就足够了(以确保在微基准测试期间 CPU 和 TSC 频率之间的关系不会改变),并且可以选择 任务集 将微基准测试限制为特定的 CPU 内核.这可确保在该微基准测试中收集的结果产生可以相互比较的结果.

Do note that there are hardware differences as to how the TSC clock is related to CPU clock. The abovementioned more recent answer goes into some detail on that. For practical purposes in Linux, it is sufficient in Linux to use cpufreq-set to disable frequency scaling (to ensure the relationship between the CPU and TSC frequencies does not change during microbenchmarking), and optionally taskset to restrict the microbenchmark to specific CPU core(s). That ensures that the results gathered in that microbenchmark yield results that can be compared to each other.

(正如 Peter Cordes 所说,我们还想从 (包含在 中)添加 _mm_lfence()).这确保了与要进行基准测试的函数相比,CPU 不会在内部重新排序 RDTSC 操作.如果需要,您可以在编译时使用 -DNO_LFENCE 来省略这些.)

(As Peter Cordes commented, we also want to add _mm_lfence() from <emmintrin.h> (included by <immintrin.h>). This ensures that the CPU does not internally reorder the RDTSC operation compared to the function to be benchmarked. You can use -DNO_LFENCE at compile time to omit those, if you want.)

假设您有要比较的函数 void foo(void);void bar(void);:

Let's say you have functions void foo(void); and void bar(void); that you wish to compare:

#include <stdlib.h>
#include <x86intrin.h>
#include <stdio.h>

#ifdef    NO_LFENCE
#define   lfence()
#else
#include <emmintrin.h>
#define   lfence()  _mm_lfence()
#endif

static int cmp_ull(const void *aptr, const void *bptr)
{
    const unsigned long long  a = *(const unsigned long long *)aptr;
    const unsigned long long  b = *(const unsigned long long *)bptr;
    return (a < b) ? -1 :
           (a > b) ? +1 : 0;
}

unsigned long long *measure_cycles(size_t count, void (*func)())
{
    unsigned long long  *elapsed, started, finished;
    size_t               i;

    elapsed = malloc((count + 2) * sizeof elapsed[0]);
    if (!elapsed)
        return NULL;

    /* Call func() count times, measuring the TSC cycles for each call. */
    for (i = 0; i < count; i++) {
        /* First, let's ensure our CPU executes everything thus far. */
        lfence();
        /* Start timing. */
        started = __rdtsc();
        /* Ensure timing starts before we call the function. */
        lfence();
        /* Call the function. */
        func();
        /* Ensure everything has been executed thus far. */
        lfence();
        /* Stop timing. */
        finished = __rdtsc();
        /* Ensure we have the counter value before proceeding. */
        lfence();

        elapsed[i] = finished - started;
    }

    /* The very first call is likely the cold-cache case,
       so in case that measurement might contain useful
       information, we put it at the end of the array.
       We also terminate the array with a zero. */
    elapsed[count] = elapsed[0];
    elapsed[count + 1] = 0;

    /* Sort the cycle counts. */
    qsort(elapsed, count, sizeof elapsed[0], cmp_ull);

    /* This function returns all cycle counts, in sorted order,
       although the median, elapsed[count/2], is the one
       I personally use. */
    return elapsed;
}

void benchmark(const size_t count)
{
    unsigned long long  *foo_cycles, *bar_cycles;

    if (count < 1)
        return;

    printf("Measuring run time in Time Stamp Counter cycles:\n");
    fflush(stdout);

    foo_cycles = measure_cycles(count, foo);
    bar_cycles = measure_cycles(count, bar);

    printf("foo(): %llu cycles (median of %zu calls)\n", foo_cycles[count/2], count);
    printf("bar(): %llu cycles (median of %zu calls)\n", bar_cycles[count/2], count);

    free(bar_cycles);
    free(foo_cycles);
}

请注意,上述结果非常特定于所使用的编译器和编译器选项,当然也适用于运行它的硬件.周期的中位数可以解释为采用的 TSC 周期的典型数量",因为测量并不完全可靠(可能会受到进程外事件的影响;例如,通过上下文切换,或迁移到另一个内核上一些 CPU).出于同样的原因,我不相信最小值、最大值或平均值.

Note that the above results are very specific to the compiler and compiler options used, and of course on the hardware it is run on. The median number of cycles can be interpreted as "the typical number of TSC cycles taken", because the measurement is not completely reliable (may be affected by events outside the process; for example, by context switches, or by migration to another core on some CPUs). For the same reason, I don't trust the minimum, maximum, or average values.

但是,可以比较上面的两个实现(foo()bar())的循环计数以找出它们如何在微基准测试中,性能相互比较.请记住,微基准测试结果可能不会扩展到实际工作任务,因为任务的资源使用交互是多么复杂.一个函数可能在所有微基准测试中都优越,但在现实世界中比其他函数差,因为它只有在有大量 CPU 缓存可供使用时才有效,例如.

However, the two implementations' (foo() and bar()) cycle counts above can be compared to find out how their performance compares to each other, in a microbenchmark. Just remember that microbenchmark results may not extend to real work tasks, because of how complex tasks' resource use interactions are. One function might be superior in all microbenchmarks, but poorer than others in real world, because it is only efficient when it has lots of CPU cache to use, for example.

 

一般在 Linux 中,您可以使用 CLOCK_REALTIME 时钟来测量所使用的实时时间(挂钟时间),方法与上述相同.CLOCK_MONOTONIC 甚至更好,因为它不受管理员可能对实时时钟进行直接更改的影响(例如,如果他们注意到系统时钟超前或滞后);仅应用由于 NTP 等引起的漂移调整.夏令时或其变化不会影响使用任一时钟的测量.同样,许多测量的中位数是我寻求的结果,因为测量代码本身之外的事件会影响结果.

In Linux in general, you can use the CLOCK_REALTIME clock to measure real time (wall clock time) used, in the very same manner as above. CLOCK_MONOTONIC is even better, because it is not affected by direct changes to the realtime clock the administrator might make (say, if they noticed the system clock is ahead or behind); only drift adjustments due to NTP etc. are applied. Daylight savings time or changes thereof does not affect the measurements, using either clock. Again, the median of a number of measurements is the result I seek, because events outside the measured code itself can affect the result.

例如:

#define _POSIX_C_SOURCE 200809L
#include <stdlib.h>
#include <stdio.h>
#include <time.h>

#ifdef   NO_LFENCE
#define  lfence()
#else
#include <emmintrin.h>
#define  lfence() _mm_lfence()
#endif

static int cmp_double(const void *aptr, const void *bptr)
{
    const double a = *(const double *)aptr;
    const double b = *(const double *)bptr;
    return (a < b) ? -1 :
           (a > b) ? +1 : 0;
}

double median_seconds(const size_t count, void (*func)())
{
    struct timespec started, stopped;
    double         *seconds, median;
    size_t          i;

    seconds = malloc(count * sizeof seconds[0]);
    if (!seconds)
        return -1.0;

    for (i = 0; i < count; i++) {
        lfence();
        clock_gettime(CLOCK_MONOTONIC, &started);
        lfence();
        func();
        lfence();
        clock_gettime(CLOCK_MONOTONIC, &stopped);
        lfence();
        seconds[i] = (double)(stopped.tv_sec - started.tv_sec)
                   + (double)(stopped.tv_nsec - started.tv_nsec) / 1000000000.0;
    }

    qsort(seconds, count, sizeof seconds[0], cmp_double);
    median = seconds[count / 2];
    free(seconds);
    return median;
}

static double realtime_precision(void)
{
    struct timespec t;

    if (clock_getres(CLOCK_REALTIME, &t) == 0)
        return (double)t.tv_sec
             + (double)t.tv_nsec / 1000000000.0;

    return 0.0;
}

void benchmark(const size_t count)
{
    double median_foo, median_bar;
    if (count < 1)
        return;

    printf("Median wall clock times over %zu calls:\n", count);
    fflush(stdout);

    median_foo = median_seconds(count, foo);
    median_bar = median_seconds(count, bar);

    printf("foo(): %.3f ns\n", median_foo * 1000000000.0);
    printf("bar(): %.3f ns\n", median_bar * 1000000000.0);

    printf("(Measurement unit is approximately %.3f ns)\n", 1000000000.0 * realtime_precision());
    fflush(stdout);
}

 

总的来说,我个人更喜欢在一个单独的单元中编译基准函数(到一个单独的目标文件),并且还对一个无用函数进行基准测试来估计函数调用开销(尽管它往往会高估开销;即产生过大的开销估计,因为一些函数调用开销是延迟而不是实际花费的时间,并且在实际函数中的这些延迟期间可能有一些操作).

In general, I personally prefer to compile the benchmarked function in a separate unit (to a separate object file), and also benchmark a do-nothing function to estimate the function call overhead (although it tends to give an overestimate for the overhead; i.e. yield too large an overhead estimate, because some of the function call overhead is latencies and not actual time taken, and some operations are possible during those latencies in the actual functions).

重要的是要记住,上述测量值只能用作指示,因为在现实世界的应用程序中,诸如缓存局部性(尤其是在当前机器上,具有多级缓存和大量内存)之类的东西会极大地影响不同实现使用的时间.

It is important to remember that the above measurements should only be used as indications, because in a real world application, things like cache locality (especially on current machines, with multi-level caching, and lots of memory) hugely affect the time used by different implementations.

例如,您可以比较快速排序和基数排序的速度.根据键的大小,基数排序需要相当大的额外数组(并使用大量缓存).如果使用排序例程的实际应用程序不会同时使用大量其他内存(因此排序后的数据基本上就是缓存的数据),那么如果有足够的数据(并且实现是合理的),基数排序会更快).但是,如果应用程序是多线程的,并且其他线程混洗(复制或传输)大量内存,那么使用大量缓存的基数排序将驱逐其他缓存的数据;即使基数排序函数本身没有表现出任何严重的减速,它可能会减慢其他线程并因此减慢整个程序,因为其他线程必须等待它们的数据被重新缓存.

For example, you might compare the speeds of a quicksort and a radix sort. Depending on the size of the keys, the radix sort requires rather large extra arrays (and uses a lot of cache). If the real application the sort routine is used in does not simultaneously use a lot of other memory (and thus the sorted data is basically what is cached), then a radix sort will be faster if there is enough data (and the implementation is sane). However, if the application is multithreaded, and the other threads shuffle (copy or transfer) a lot of memory around, then the radix sort using a lot of cache will evict other data also cached; even though the radix sort function itself does not show any serious slowdown, it may slow down the other threads and therefore the overall program, because the other threads have to wait for their data to be re-cached.

这意味着您应该信任的唯一基准"是实际硬件上使用的挂钟测量,使用实际工作数据运行实际工作任务.其他一切都受制于许多条件,并且或多或少是可疑的:迹象,是的,但不是很可靠.

This means that the only "benchmarks" you should trust, are wall clock measurements used on the actual hardware, running actual work tasks with actual work data. Everything else is subject to many conditions, and are more or less suspect: indications, yes, but not very reliable.

这篇关于使用时间戳计数器获取时间戳的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆