RDTSCP与RDTSC + CPUID [英] RDTSCP versus RDTSC + CPUID

查看:1051
本文介绍了RDTSCP与RDTSC + CPUID的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在做一些Linux内核计时,特别是在中断处理"路径中.我一直在使用RDTSC进行计时,但是最近我了解到它不一定准确,因为指令可能会发生混乱.

I'm doing some Linux Kernel timings, specifically in the Interrupt Handling path. I've been using RDTSC for timings, however I recently learned it's not necessarily accurate as the instructions could be happening out of order.

然后我尝试:

  1. RDTSC + CPUID(此处为相反顺序)以刷新管道,并且由于超级调用而在虚拟机(我的工作环境)上产生了高达60倍的开销(!)还有什么.启用和不启用硬件虚拟化都可以.

  1. RDTSC + CPUID (in reverse order, here) to flush the pipeline, and incurred up to a 60x overhead (!) on a Virtual Machine (my working environment) due to hypercalls and whatnot. This is both with and without HW Virtualization enabled.

最近,我遇到了RDTSCP *指令,该指令似乎可以执行RDTSC + CPUID的操作,但由于它是一种较新的指令,因此效率更高-相对而言只有1.5x-2x的开销.

Most recently I've come across the RDTSCP* instruction, which seems to do what RDTSC+CPUID did, but more efficiently as it's a newer instruction - only a 1.5x-2x overhead, relatively.

我的问题:作为测量点, RDTSCP 是否真的准确,并且是进行计时的正确"方法吗?

My question: Is RDTSCP truly accurate as a point of measurement, and is it the "correct" way of doing the timing?

更明确的是,我的时间安排在内部基本上是这样的:

Also to be more clear, my timing is essentially like this, internally:

  • 保存当前循环计数器值
  • 执行一种基准测试(即:磁盘,网络)
  • 将当前周期和上一周期计数器的增量添加到累加器值,并根据每个中断递增一个计数器
  • 最后,将增量/累加器除以中断数量,以获得每个中断的平均周期成本.

* 推荐答案

可从

A full discussion of the overhead you're seeing from the cpuid instruction is available at this stackoverflow thread. When using rdtsc, you need to use cpuid to ensure that no additional instructions are in the execution pipeline. The rdtscp instruction flushes the pipeline intrinsically. (The referenced SO thread also discusses these salient points, but I addressed them here because they're part of your question as well).

如果您的处理器不支持rdtscp,则仅需要"使用cpuid + rdtsc.否则,rdtscp就是您想要的,并且可以准确地为您提供所需的信息.

You only "need" to use cpuid+rdtsc if your processor does not support rdtscp. Otherwise, rdtscp is what you want, and will accurately give you the information you are after.

这两个指令都为您提供了一个64位单调递增的计数器,该计数器代表处理器上的循环数.如果这是您的模式:

Both instructions provide you with a 64-bit, monotonically increasing counter that represents the number of cycles on the processor. If this is your pattern:

uint64_t s, e;
s = rdtscp();
do_interrupt();
e = rdtscp();

atomic_add(e - s, &acc);
atomic_add(1, &counter);

根据读取的位置,您的平均测量值可能仍然与标准偏差一一.例如:

You may still have an off-by-one in your average measurement depending on where your read happens. For instance:

   T1                              T2
t0 atomic_add(e - s, &acc);
t1                                 a = atomic_read(&acc);
t2                                 c = atomic_read(&counter);
t3 atomic_add(1, &counter);
t4                                 avg = a / c;

目前尚不清楚末日"是否指的是可能以这种方式进行比赛的时间.如果是这样,您可能要计算与您的增量成一直线的移动平均值或移动平均值.

It's unclear whether "[a]t the end" references a time that could race in this fashion. If so, you may want to calculate a running average or a moving average in-line with your delta.

边点:

  1. 如果确实使用cpuid + rdtsc,则需要减去cpuid指令的开销,这可能很难确定您是否在VM中(取决于VM如何实现此指令).这就是为什么您应该坚持使用rdtscp的原因.
  2. 在循环内执行rdtscp通常不是一个好主意.我在某些情况下经常会看到类似的微基准测试

-

for (int i = 0; i < SOME_LARGEISH_NUMBER; i++) {
   s = rdtscp();
   loop_body();
   e = rdtscp();
   acc += e - s;
}

printf("%"PRIu64"\n", (acc / SOME_LARGEISH_NUMBER / CLOCK_SPEED));

尽管这可以使您对loop_body()中任何内容的周期整体性能有一个不错的了解,但它不利于诸如流水线之类的处理器优化.在微基准测试中,处理器将在循环中很好地完成分支预测,因此可以很好地测量循环开销.按照上面显示的方式进行操作也是不好的,因为每个循环迭代最终会导致2个流水线停顿.因此:

While this will give you a decent idea of the overall performance in cycles of whatever is in loop_body(), it defeats processor optimizations such as pipelining. In microbenchmarks, the processor will do a pretty good job of branch prediction in the loop, so measuring the loop overhead is fine. Doing it the way shown above is also bad because you end up with 2 pipeline stalls per loop iteration. Thus:

s = rdtscp();
for (int i = 0; i < SOME_LARGEISH_NUMBER; i++) {
   loop_body();
}
e = rdtscp();
printf("%"PRIu64"\n", ((e-s) / SOME_LARGEISH_NUMBER / CLOCK_SPEED));

相对于以前的基准测试,在现实生活中您将看到的内容将更加高效并且可能更加准确.

Will be more efficient and probably more accurate in terms of what you'll see in Real Life versus what the previous benchmark would tell you.

这篇关于RDTSCP与RDTSC + CPUID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆