英特尔的时间戳读取asm代码示例是否使用了两个不必要的寄存器? [英] Is Intel's timestamp reading asm code example using two more registers than are necessary?

查看:80
本文介绍了英特尔的时间戳读取asm代码示例是否使用了两个不必要的寄存器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究使用x86 CPU中的时间戳寄存器(TSR)来衡量基准性能.这是一个有用的寄存器,因为它以不受时钟影响的单调时间单位进行测量 速度变化.很酷.

I'm looking into measuring benchmark performance using the time-stamp register (TSR) found in x86 CPUs. It's a useful register, since it measures in a monotonic unit of time which is immune to the clock speed changing. Very cool.

这是一份英特尔文档,其中显示了使用TSR(包括使用cpuid进行管道同步)可靠地进行基准测试的asm代码片段.参见第16页:

Here is an Intel document showing asm snippets for reliably benchmarking using the TSR, including using cpuid for pipeline synchronisation. See page 16:

http://www.intel.com/content/www/us/en/embedded/training/ia-32-ia-64-benchmark-code-execution-paper.html

要阅读开始时间,它说(我做了一点注释):

To read the start time, it says (I annotated a bit):

__asm volatile (
    "cpuid\n\t"             // writes e[abcd]x
    "rdtsc\n\t"             // writes edx, eax
    "mov %%edx, %0\n\t" 
    "mov %%eax, %1\n\t"
    //
    :"=r" (cycles_high), "=r" (cycles_low)  // outputs
    :                                       // inputs
    :"%rax", "%rbx", "%rcx", "%rdx");       // clobber

我想知道为什么使用暂存器获取edx的值 和eax.为什么不删除movs并直接从edx中读取TSR值 和eax?像这样:

I'm wondering why scratch registers are used to take the values of edx and eax. Why not remove the movs and read the TSR value right out of edx and eax? Like this:

__asm volatile(                                                             
    "cpuid\n\t"
    "rdtsc\n\t"
    //
    : "=d" (cycles_high), "=a" (cycles_low) // outputs
    :                                       // inputs
    : "%rbx", "%rcx");                      // clobber     

这样做,您保存了两个寄存器,从而降低了C的可能性. 编译器需要溢出.

By doing this, you save two registers, reducing the likelihood of the C compiler needing to spill.

我是对的吗?还是那些MOV具有某种战略意义?

Am I right? Or those MOVs are somehow strategic?

(我同意您确实需要临时寄存器来读取停止时间,因为 在这种情况下,说明的顺序相反:您有 rdtscp,...,cpuid. cpuid指令破坏了rdtscp的结果.

(I agree that you do need scratch registers to read the stop time, as in that scenario the order of the instructions is reversed: you have rdtscp, ..., cpuid. The cpuid instruction destroys the result of rdtscp).

谢谢

推荐答案

您是正确的,该示例很笨拙. 通常,如果mov是inline-asm语句中的第一条指令或最后一条指令,则说明您做错了,并且应该使用约束条件来告诉编译器您希望输入的内容或位置输出是.

You're correct, the example is clunky. Usually if mov is the first or last instruction in an inline-asm statement, you're doing it wrong, and should have used a constraint to tell the compiler where you want the input, or where the output is.

请参见我的GNU C嵌入式asm指南/链接集合,以及标签Wiki. (标签Wiki的问题,对于一般也有asm.)

See my GNU C inline asm guides / links collection, and other links in the inline-assembly tag wiki. (The x86 tag wiki is full of good stuff for asm in general, too.)

或者对于rdtsc,具体参见获取CPU周期计数? 表示__rdtsc()内在函数,以及@Mysticial答案中的良好内联汇编.

Or for rdtsc specifically, see Get CPU cycle count? for the __rdtsc() intrinsic, and good inline asm in @Mysticial's answer.

它以单调时间单位进行测量,不受时钟速度变化的影响.

it measures in a monotonic unit of time which is immune to the clock speed changing.

是的,在过去10年左右制造的CPU上.

Yes, on CPUs made within the last 10 years or so.

对于概要分析,将时间包含在核心时钟周期中,而不是壁钟时间,通常更有用,

For profiling, it's often more useful to have times in core clock cycles, not wall-clock time, so your microbenchmark results don't depend on power-saving / turbo. Performance counters can do this and much more.

不过,如果您想要实时,那么rdtsc是获取成本最低的方法.

Still, if real time is what you want, rdtsc is the lowest-overhead way to get it.

然后重新:在注释中进行讨论:是的cpuid可以序列化,确保rdtsc和以下指令在CPUID之后才能开始执行.您可以在RDTSC后面放置另一个CPUID,但这会增加测量开销,并且我认为在准确性/精确度方面几乎接近零.

And re: discussion in comments: yes cpuid is there to serialize, making sure that rdtsc and following instructions can't begin executing until after CPUID. You could put another CPUID after RDTSC, but that would increase measurement overhead, and I think give near-zero gain in accuracy / precision.

LFENCE是更便宜的替代品,可用于RDTSC. 指令参考手动输入记录了这样一个事实,即它不允许以后的指令开始执行直到它和以前的说明已经退役(来自核心乱序部分的ROB/RS).请参阅加载并存储唯一的指令,有关使用它的具体示例,请参见

LFENCE is a cheaper alternative that's useful with RDTSC. The instruction ref manual entry documents the fact that it doesn't let later instructions start executing until it and previous instructions have retired (from the ROB/RS in the out-of-order part of the core). See Are loads and stores the only instructions that gets reordered?, and for a specific example of using it, see clflush to invalidate cache line via C function. Unlike true serializing instructions like cpuid, it doesn't flush the store buffer.

(在未启用Spectre缓解措施的最新AMD CPU上,lfence甚至没有部分序列化,并且根据 LFENCE是否在AMD处理器上序列化了?)

(On recent AMD CPUs without Spectre mitigation enabled, lfence is not even partially serializing, and runs at 4 per clock according to Agner Fog's testing. Is LFENCE serializing on AMD processors?)

玛格丽特·布鲁姆(Margaret Bloom)挖掘了此有用的链接,这也证实LFENCE根据英特尔的SDM对RDTSC进行了序列化,并且还有其他一些关于如何围绕RDTSC进行序列化的内容.

Margaret Bloom dug up this useful link, which also confirms that LFENCE serializes RDTSC according to Intel's SDM, and has some other stuff about how to do serialization around RDTSC.

这篇关于英特尔的时间戳读取asm代码示例是否使用了两个不必要的寄存器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆