推荐答案
可从
A full discussion of the overhead you're seeing from the cpuid instruction is available at this stackoverflow thread. When using rdtsc, you need to use cpuid to ensure that no additional instructions are in the execution pipeline. The rdtscp instruction flushes the pipeline intrinsically. (The referenced SO thread also discusses these salient points, but I addressed them here because they're part of your question as well).
如果您的处理器不支持rdtscp,则仅需要"使用cpuid + rdtsc.否则,rdtscp就是您想要的,并且可以准确地为您提供所需的信息.
You only "need" to use cpuid+rdtsc if your processor does not support rdtscp. Otherwise, rdtscp is what you want, and will accurately give you the information you are after.
这两个指令都为您提供了一个64位单调递增的计数器,该计数器代表处理器上的循环数.如果这是您的模式:
Both instructions provide you with a 64-bit, monotonically increasing counter that represents the number of cycles on the processor. If this is your pattern:
uint64_t s, e;
s = rdtscp();
do_interrupt();
e = rdtscp();
atomic_add(e - s, &acc);
atomic_add(1, &counter);
根据读取的位置,您的平均测量值可能仍然与标准偏差一一.例如:
You may still have an off-by-one in your average measurement depending on where your read happens. For instance:
T1 T2
t0 atomic_add(e - s, &acc);
t1 a = atomic_read(&acc);
t2 c = atomic_read(&counter);
t3 atomic_add(1, &counter);
t4 avg = a / c;
目前尚不清楚末日"是否指的是可能以这种方式进行比赛的时间.如果是这样,您可能要计算与您的增量成一直线的移动平均值或移动平均值.
It's unclear whether "[a]t the end" references a time that could race in this fashion. If so, you may want to calculate a running average or a moving average in-line with your delta.
边点:
- 如果确实使用cpuid + rdtsc,则需要减去cpuid指令的开销,这可能很难确定您是否在VM中(取决于VM如何实现此指令).这就是为什么您应该坚持使用rdtscp的原因.
- 在循环内执行rdtscp通常不是一个好主意.我在某些情况下经常会看到类似的微基准测试
-
for (int i = 0; i < SOME_LARGEISH_NUMBER; i++) {
s = rdtscp();
loop_body();
e = rdtscp();
acc += e - s;
}
printf("%"PRIu64"\n", (acc / SOME_LARGEISH_NUMBER / CLOCK_SPEED));
尽管这可以使您对loop_body()
中任何内容的周期整体性能有一个不错的了解,但它不利于诸如流水线之类的处理器优化.在微基准测试中,处理器将在循环中很好地完成分支预测,因此可以很好地测量循环开销.按照上面显示的方式进行操作也是不好的,因为每个循环迭代最终会导致2个流水线停顿.因此:
While this will give you a decent idea of the overall performance in cycles of whatever is in loop_body()
, it defeats processor optimizations such as pipelining. In microbenchmarks, the processor will do a pretty good job of branch prediction in the loop, so measuring the loop overhead is fine. Doing it the way shown above is also bad because you end up with 2 pipeline stalls per loop iteration. Thus:
s = rdtscp();
for (int i = 0; i < SOME_LARGEISH_NUMBER; i++) {
loop_body();
}
e = rdtscp();
printf("%"PRIu64"\n", ((e-s) / SOME_LARGEISH_NUMBER / CLOCK_SPEED));
相对于以前的基准测试,在现实生活中您将看到的内容将更加高效并且可能更加准确.
Will be more efficient and probably more accurate in terms of what you'll see in Real Life versus what the previous benchmark would tell you.
这篇关于RDTSCP与RDTSC + CPUID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!