(rdtsc + lfence + rdtsc)和(rdtsc + rdtscp)在测量执行时间上有什么区别吗? [英] Is there any difference in between (rdtsc + lfence + rdtsc) and (rdtsc + rdtscp) in measuring execution time?

查看:313
本文介绍了(rdtsc + lfence + rdtsc)和(rdtsc + rdtscp)在测量执行时间上有什么区别吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

据我所知,相对于rdtsc和rdtscp指令,处理器中运行时顺序的主要区别在于执行是否要等到所有先前的指令都在本地执行后才能执行。

As far as I know, the main difference in runtime ordering in a processor with respect to rdtsc and rdtscp instruction is that whether the execution waits until all previous instructions are executed locally.

换句话说,这意味着lfence + rdtsc = rdtscp,因为在rdtsc指令之前的lfence使以下rdtsc在本地所有前一条指令完成之后执行。

In other words, it means lfence + rdtsc = rdtscp because lfence preceding the rdtsc instruction makes the following rdtsc to be executed after all previous instruction finish locally.

但是,我看到了一些示例代码,这些代码在测量开始时使用rdtsc,在测量开始时使用rdtscp。使用两个rdtsc和rdtsc + rdtscp之间有什么区别吗?

However, I've seen some example code that uses rdtsc at the start of measurement and rdtscp at the end. Is there any difference in between making use of two rdtsc and rdtsc + rdtscp?

    lfence
    rdtsc
    lfence
    ...
    ...
    ...
    lfence
    rdtsc
    lfence



    lfence
    rdtsc
    lfence
    ...
    ...
    ...
    rdtscp
    lfence


推荐答案

TL; DR



rdtscp lfence / rdtsc 在Intel处理器上具有完全相同的上游序列化属性。在具有调度序列 fence 的AMD处理器上,两个序列还具有相同的上游序列化属性。关于后续指令,可以分派 lfence / rdtsc 序列中的 rdtsc 与后续指令同时执行。如果您还希望精确地安排这些后面的说明的时间,则此行为可能不是理想的。这通常不成问题,因为只要没有结构性危害,预留站调度程序就会将较旧的优先级分配给优先级进行调度。在围栏退休后, rdtsc uops将是RS中最早的,可能没有结构性危害,因此它们将立即派遣(可能与以后的命令一起派遣)。您还可以在 rdtsc 之后放置 lfence

TL;DR

rdtscp and lfence/rdtsc have the same exact upstream serialization properties On Intel processors. On AMD processors with a dispatch-serializing lfence, both sequences have also the same upstream serialization properties. With respect to later instructions, rdtsc in the lfence/rdtsc sequence may be dispatched for execution simultaneously with later instructions. This behavior may not be desirable if you also want to precisely time these later instructions as well. This is generally not a problem because the reservation station scheduler prioritizes older uops for dispatching as long as there are no structural hazards. After lfence retires, rdtsc uops would be the oldest in the RS with probably no structural hazards, so they will be immediately dispatched (possibly together with some later uops). You could also put an lfence after rdtsc.

Intel手册V2指出了有关 rdtscp (重点是我)的以下内容:

The Intel manual V2 says the following about rdtscp (emphasis mine):


RDTSCP指令不是序列化指令,但它会执行
,直到所有先前的指令都已执行并且所有先前的
加载在全局可见为止。但这并不等待先前的存储区
在全局范围内可见,并且后续的指令可能在执行读取操作之前开始执行

此处的读取操作部分是指读取时间戳计数器。这表明 rdtscp 在内部的作用类似于 lfence ,然后是 rdtsc +阅读 IA32_TSC_AUX 。也就是说,先执行 fence ,然后执行两次从寄存器的读取(可能同时)。

The "read operation" part here refers to reading the time-stamp counter. This suggests that rdtscp internally works like lfence followed by rdtsc + reading IA32_TSC_AUX. That is, lfence is performed first then the two reads from the registers are executed (possibly at the same time).

在大多数支持这些指令的Intel和AMD处理器上, lfence / rdtsc 的微指令数比 rdtscp 。 Agner表中提到的位数的数量适用于 lfence 指令连续执行的情况,这使得 lfence fence 实际解码为(5或6 uops)相比,将c>解码为较少的uops(1或2)。通常,在不使用其他 lfence 的情况下使用 lfence 。这就是为什么 lfence / rdtsc rdtscp 包含更多微指令的原因。 Agner的表还显示,在某些处理器上, rdtsc rdtscp 具有相同数量的uops,我不是确定是正确的。与 rdtsc 相比, rdtscp 具有一个或多个uops更有意义。就是说,延迟可能比uops数量的差异更重要,因为这直接影响测量开销。

On most Intel and AMD processors that support these instructions, lfence/rdtsc have a slightly larger number of uops than rdtscp. The number of lfence uops mentioned in Agner's tables is for the case where the lfence instructions are executed back-to-back, which makes it appear that lfence is decoded into a smaller number of uops (1 or 2) than what a single lfence is actually decoded into (5 or 6 uops). Usually, lfence is used without other back-to-back lfences. That's why lfence/rdtsc contains more uops than rdtscp. Agner's tables also show that on some processors, rdtsc and rdtscp have the same number of uops, which I'm not sure is correct. It makes more sense for rdtscp to have one or more uops than rdtsc. That said, the latency may be more important than the difference in the number of uops because that's what directly impacts the measurement overhead.

在可移植性方面, rdtsc 早于 rdtscp ; Pentium处理器首先支持 rdtsc ,而第一批支持 rdtscp 的处理器于2005-2006年发布(请参阅: 包含支持的gcc cpu类型是什么RDTSCP?)。但是,当今使用的大多数Intel和AMD处理器都支持 rdtscp 。比较这两个序列的另一个维度是 rdtscp 比<$ c污染一个寄存器(即 ECX ) $ c> rdtsc

In terms of portability, rdtsc is older than rdtscp; rdtsc was first supported on the Pentium processors while the first processors that support rdtscp were released in 2005-2006 (See: What is the gcc cpu-type that includes support for RDTSCP?). But most Intel and AMD processors that are in use today support rdtscp. Another dimension for comparing between the two sequences is that rdtscp pollutes one more register (i.e., ECX) than rdtsc.

总而言之,如果您不关心阅读 IA32_TSC_AUX rdtscp 并退回到 lfence / rdtsc (或 lfence / rdtsc / lfence )在不支持它的处理器上。如果要获得最大的计时精度,请使用讨论带有时间戳的内存延迟的方法

In summary, if you don't care about reading the IA32_TSC_AUX MSR, there is no particularly big reason why you should choose one over the other. I would use rdtscp and fall back to lfence/rdtsc (or lfence/rdtsc/lfence) on processors that don't support it. If you want maximum timing precision, use the method discussed in Memory latency measurement with time stamp counter.

Andreas Abel指出的,您仍然需要 lfence rdtsc(p)之后的c $ c>,因为它没有排序wrt后续说明:

As Andreas Abel pointed out, you still need an lfence after the last rdtsc(p) as it is not ordered w.r.t. subsequent instructions:

lfence                    lfence
rdtsc      -- ALLOWED --> B
B                         rdtsc

rdtscp     -- ALLOWED --> B
B                         rdtscp

这也是在手册中解决。

This is also addressed in the manuals.

关于 rdtscp 的使用,将其视为紧凑的 lfence + rdtsc 似乎对我来说是正确的。 >
手册中的两条说明使用了不同的术语(例如,局部完成与全局可见的负载),但所描述的行为似乎是相同的。

我假设是这样在其余的答案中。

Regarding the use of rdtscp, it seems correct to me to think of it as a compact lfence + rdtsc.
The manuals use different terminology for the two instructions (e.g. "completed locally" vs "globally visible" for loads) but the behavior described seems to be the same.
I'm assuming so in the rest of this answer.

但是 rdtscp 是一条指令,而 lfence + rdtscp 是两个,使 lfence 成为配置文件代码的一部分。

授予 lfence 就后端执行资源而言,它应该是轻量级的(这只是一个标记),它仍然占用前端资源(两个微秒?)和ROB中的插槽。

rdtscp 由于能够读取 IA32_TSC_AUX 的能力而被解码为更多的微指令,因此尽管节省了前端(部分)资源,但它占用了后端更多。

如果首先(或同时)使用处理器ID读取TSC,则此额外的uop仅与后续代码有关。

这可能是为什么在基准测试的末尾而不是在基准测试的开始使用它的原因(多余的代码会影响代码)。
这足以使一些微体系结构基准偏差/复杂化。

However rdtscp is a single instruction, while lfence + rdtscp are two, making the lfence part of the profiled code.
Granted that lfence should be lightweight in terms of backend execution resources (it is just a marker) it still occupies front-end resources (two uops?) and a slot in the ROB.
rdtscp is decoded into a greater number of uops due to its ability to read IA32_TSC_AUX, so while it saves front-end (part of) resources, it occupies the backend more.
If the read of the TSC is done first (or concurrently) with the processor ID then this extra uops are only relevant for the subsequent code.
This could be a reason why it is used at the end but not at the start of the benchmark (where the extra uops would affect the code). This is enough to bias/complicate some micro-architectural benchmarks.

您无法避免围栏 之后一个 rdtsc(p),但是您可以使用 rdtscp避免一个之前 code>。

对于第一个 rdtsc 来说,这似乎是不必要的,因为前面的 lfence 仍然没有被分析。

You cannot avoid the lfence after an rdtsc(p) but you can avoid the one before with rdtscp.
This seems unnecessary for the first rdtsc as the preceding lfence is not profiled anyway.

最后使用 rdtscp 的另一个原因是它(根据Intel)是为了检测向其他CPU的迁移(这就是为什么它也自动加载 IA32_TSC_AUX 的原因),因此在配置文件代码的最后,您可能想要检查代码是否尚未调度到另一个CPU。

Another reason to use rdtscp at the end is that it was (according to Intel) meant to detect a migration to a different CPU (that's why it atomically also load IA32_TSC_AUX), so at the end of the profiled code you may want to check that the code has not been scheduled to another CPU.


用户模式软件可以使用RDTSCP来检测在连续读取TSC之间是否发生了CPU迁移。

User mode software can use RDTSCP to detect if CPU migration has occurred between successive reads of the TSC.

当然,这需要先阅读 IA32_TSC_AUX (以便进行比较),所以应该在配置文件代码之前输入 rdpid rdtscp
如果可以承受不使用 ecx ,第一个 rdtsc 也可以是 rdtscp (但请参见上文),否则(而不是在分析的代码中存储处理器ID)可以先使用 rdpid (因此,具有 rdtsc + rdtscp 对)。

This, of course, requires to have read IA32_TSC_AUX before (to have something to compare to) so one should have a rdpid or rdtscp before the profiling code.
If one can afford to not use ecx, the first rdtsc can be a rdtscp too (but see above), otherwise (rather than storing the processor id while in the profiled code), rdpid can be used first (thus, having a rdtsc + rdtscp pair around the profiled code).

ABA问题,所以我认为Intel在这方面没有强项(除非我们限制自己编写的代码足够短以至于最多可以重新安排一次)。

This is open to ABA problem, so I don't think Intel has a strong point on this (unless we restrict ourselves to code short enough to be rescheduled at most once).

编辑
正如PeterCordes指出的那样,从经过时间度量的角度来看,迁移A-> B-> A并不是问题,因为参考时钟是相同的。

EDIT As PeterCordes pointed out, from the point of view of the elapsed time measure, having a migration A->B->A is not an issue as the reference clock is the same.

有关 rdtsc(p)为何不完整的更多信息序列化:为什么RDTSC不是序列化指令?

More information on why rdtsc(p) is not fully serializing: Why isn't RDTSC a serializing instruction? .

这篇关于(rdtsc + lfence + rdtsc)和(rdtsc + rdtscp)在测量执行时间上有什么区别吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆