NASM 中的 RDTSCP 总是返回相同的值(对单个指令进行计时) [英] RDTSCP in NASM always returns the same value (timing a single instruction)

查看:103
本文介绍了NASM 中的 RDTSCP 总是返回相同的值(对单个指令进行计时)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 NASM 中使用 RDTSC 和 RDTSCP 来测量各种汇编语言指令的机器周期以帮助优化.

I am using RDTSC and RDTSCP in NASM to measure machine cycles for various assembly language instructions to help in optimization.

我阅读了英特尔的 Gabriele Paoloni(2010 年 9 月)所著的如何对英特尔 IA-32 和 IA-64 指令集架构上的代码执行时间进行基准测试"以及其他网络资源(其中大部分是 C 语言的示例).

I read "How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures" by Gabriele Paoloni at Intel (September 2010) and other web resources (most of which were examples in C).

使用下面的代码(从 C 翻译),我测试了各种指令,但 RDTSCP 在 RDX 中总是返回零,在 RAX 中返回 7.我一开始以为 7 是周期数,但显然不是所有指令都需要 7 个周期.

Using the code below (translated from C), I test various instructions, but RDTSCP always returns zero in RDX and 7 in RAX. I first thought 7 is the number of cycles, but obviously not all instructions take 7 cycles.

rdtsc
cpuid
addsd xmm14,xmm1 ; Instruction to time
rdtscp
cpuid

这将返回 7,这并不奇怪,因为在某些架构上,addd 是 7 个周期,其中包括延迟.前两条指令可以(根据某些人的说法)被颠倒,先是 cpuid,然后是 rdtsc,但这在这里没有区别.

This returns 7, which is not surprising because on some architectures addsd is 7 cycles with latency included. The first two instructions can (according to some) be reversed, cpuid first then rdtsc, but that makes no difference here.

当我将指令更改为 2 周期指令时:

When I change the instruction to a 2-cycle instruction:

rdtsc
cpuid
add rcx,rdx ; Instruction to time
rdtscp
cpuid

这也会在 rax 中返回 7,在 rdx 中返回 0.

This also returns 7 in rax and zero in rdx.

所以我的问题是:

  1. 如何访问和解释 RDX:RAX 中返回的值?

  1. How do I access and interpret the values returned in RDX:RAX?

为什么 RDX 总是返回零,它应该返回什么?

Why does RDX always return zero, and what is it supposed to return?

更新:

如果我把代码改成这样:

If I change the code to this:

cpuid
rdtsc
mov [start_time],rax
addsd xmm14,xmm1 ; INSTRUCTION
rdtscp
mov [end_time],rax
cpuid
mov rax,[end_time]
mov rdx,[start_time]
sub rax,rdx

我在 rax 中得到了 64,但这听起来循环太多了.

I get 64 in rax, but that sounds like too many cycles.

推荐答案

您的第一个代码(导致标题问题)有问题,因为它覆盖了 rdtscrdtscp 带有 cpuid 的结果会导致 EAX、EBX、ECX 和 EDX.

Your first code (leading to the title question) is buggy because it overwrites the rdtsc and rdtscp results with the cpuid results in EAX,EBX,ECX and EDX.

使用lfence代替cpuid;在 Intel 和 AMD 上启用了 Spectre 缓解,lfence 将序列化指令流,从而使用 rdtsc 执行您想要的操作.

Use lfence instead of cpuid; on Intel since forever and AMD with Spectre mitigation enabled, lfence will serialize the instruction stream and thus do what you want with rdtsc.

请记住,RDTSC 计算参考周期数,而不是核心时钟周期数. 获取 CPU 周期计数? 以及有关 RDTSC 的更多信息.

Remember that RDTSC counts reference cycles, not core clock cycles. Get CPU cycle count? for that and more about RDTSC.

您的测量间隔内没有 cpuidlfence.但是你确实在测量间隔中有rdtscp本身.背靠背 rdtscp 并不快,64 个参考周期听起来完全合理,如果您在不预热 CPU 的情况下运行.空闲时钟速度通常比参考周期慢很多;1 个参考周期等于或接近贴纸"频率,例如Intel CPU 上的最大非涡轮增压持续频率.例如4GHz"Skylake CPU 上的 4008 MHz.

You don't have cpuid or lfence inside your measurement interval. But you do have rdtscp itself in the measurement interval. Back-to-back rdtscp is not fast, 64 reference cycles sounds totally reasonable if you ran without warming up the CPU. Idle clock speed is usually a lot slower than a reference cycle; 1 reference cycle is equal or close to the "sticker" frequency, e.g. max non-turbo sustained frequency, on Intel CPUs. e.g. 4008 MHz on a "4GHz" Skylake CPU.

重要的是在另一条指令可以使用结果之前的延迟,而不是直到它从无序后端完全退出之前的延迟. RDTSC 可用于计时相对变化 一条加载或一条存储指令需要多长时间,但开销意味着你不会得到很好的绝对时间.

What matters is latency before another instruction can use the result, not latency until it fully retires from the out-of-order back-end. RDTSC can be useful for timing relative variations in how long one load or one store instruction takes, but the overhead means you won't get a good absolute time.

不过,您可以尝试减去测量开销.例如clflush 通过 C 函数使缓存行无效.另请参阅后续内容:使用时间戳计数器和clock_gettime用于缓存未命中使用时间戳计数器测量内存延迟.

You can try to subtract measurement overhead, though. e.g. clflush to invalidate cache line via C function. And see also the followups: Using time stamp counter and clock_gettime for cache miss and Memory latency measurement with time stamp counter.

这是我通常用来分析短块指令的延迟或吞吐量(以及 uops 融合和未融合域)的方法.调整你如何使用它来像这里这样的延迟瓶颈,或者如果你只想测试吞吐量.例如使用带有足够不同寄存器的 %rep 块来隐藏延迟,或者在一个短块之后使用 pxor xmm3, xmm3 打破依赖链,让无序的 exec 工作这是魔法.(只要你前端没有瓶颈.)

This is what I usually use to profile latency or throughput (and uops fused and unfused domain) of an instruction of short block. Adjust how you use it to bottleneck on latency like here, or not if you want to just test throughput. e.g. with a %rep block with enough different registers to hide latency, or break dependency chains with a pxor xmm3, xmm3 after a short block and let out-of-order exec work its magic. (As long as you don't bottleneck on the front-end.)

您可能想要使用 NASM 的 smartalign 包,或使用 YASM,以避免 ALIGN 指令的单字节 NOP 指令墙.即使在始终支持 long-NOP 的 64 位模式下,NASM 也默认为非常愚蠢的 NOP.

You might want to use NASM's smartalign package, or use YASM, to avoid a wall of single-byte NOP instructions for the ALIGN directive. NASM defaults to really stupid NOPs even in 64-bit mode where long-NOP is always supported.

global _start
_start:
    mov   ecx, 1000000000
; linux static executables start with XMM0..15 already zeroed
align 32                     ; just for good measure to avoid uop-cache effects
.loop:
    ;; LOOP BODY, put whatever you want to time in here
    times 4   addsd  xmm4, xmm3

    dec   ecx
    jnz   .loop

    mov  eax, 231
    xor  edi, edi
    syscall          ; x86-64 Linux sys_exit_group(0)

使用类似这样的单行程序运行它,将其链接到静态可执行文件中,并使用 perf stat 对其进行配置,您可以在每次更改时向上箭头并重新运行来源:

Run this with something like this one-liner that links it into a static executable and profiles it with perf stat, which you can up-arrow and re-run every time you change the source:

(我实际上将 nasm+ld + 可选反汇编放入一个名为 asm-link 的 shell 脚本中,以在我不进行分析时节省输入.反汇编确保循环中的内容是什么你想要进行分析,特别是如果你的代码中有一些 %if 的东西.而且它就在你的终端上,就在个人资料之前,如果你想回滚在你的头脑中测试理论.)

(I actually put the nasm+ld + optional disassemble into a shell script called asm-link, to save typing when I'm not profiling. Disassembling makes sure that what's in your loop is what you meant to profile, especially if you have some %if stuff in your code. And also so it's on your terminal right before the profile, if you want to scroll back while testing theories in your head.)

t=testloop; nasm -felf64 -g "$t.asm" && ld "$t.o" -o "$t" &&  objdump -drwC -Mintel "$t" &&
 taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread -r4 ./"$t"

i7-6700k 在 3.9GHz 下的结果(当前的 perf 有第二列的单位缩放显示错误.上游已修复,但 Arch Linux 尚未更新):

Result from i7-6700k at 3.9GHz (current perf has a unit-scaling display bug for the secondary column. It's fixed upstream but Arch Linux hasn't updated yet.):

 Performance counter stats for './testloop' (4 runs):

          4,106.09 msec task-clock                #    1.000 CPUs utilized            ( +-  0.01% )
                17      context-switches          #    4.080 M/sec                    ( +-  5.65% )
                 0      cpu-migrations            #    0.000 K/sec                  
                 2      page-faults               #    0.487 M/sec                  
    16,012,778,144      cycles                    # 3900323.504 GHz                   ( +-  0.01% )
     1,001,537,894      branches                  # 243950284.862 M/sec               ( +-  0.00% )
     6,008,071,198      instructions              #    0.38  insn per cycle           ( +-  0.00% )
     5,013,366,769      uops_issued.any           # 1221134275.667 M/sec              ( +-  0.01% )
     5,013,217,655      uops_executed.thread      # 1221097955.182 M/sec              ( +-  0.01% )

          4.106283 +- 0.000536 seconds time elapsed  ( +-  0.01% )

在我的 i7-6700k (Skylake) 上,addsd 有 4 个周期的延迟,0.5c 的吞吐量.(即每个时钟 2 个,如果延迟不是瓶颈).请参阅 https://agner.org/optimize/https://uops.info/http://instlatx64.atw.hu/.

On my i7-6700k (Skylake), addsd has 4 cycle latency, 0.5c throughput. (i.e. 2 per clock, if latency wasn't the bottleneck). See https://agner.org/optimize/, https://uops.info/, and http://instlatx64.atw.hu/.

每个分支 16 个周期 = 每个 4 个链的 16 个周期 addsd = addsd 的 4 个周期延迟,再现 Agner Fog 对 4 个周期的测量结果优于 1即使对于这个包括一点点启动开销和中断开销的测试,也要参与 100 次.

16 cycles per branch = 16 cycles per chain of 4 addsd = 4 cycle latency for addsd, reproducing Agner Fog's measurement of 4 cycles to better than 1 part in 100 even for this test that includes a tiny bit of startup overhead, and interrupt overhead.

选择不同的计数器进行记录.添加一个 :u,比如 instructions:u 到 perf 甚至只会计算用户空间指令,不包括在中断处理程序期间运行的任何指令.我通常不这样做,因此我可以将开销视为挂钟时间解释的一部分.但是如果这样做,cycles:u 可以非常instructions:u 紧密匹配.

Take your pick of different counters to record. Adding a :u, like instructions:u to a perf even will count only user-space instructions, excluding any that ran during interrupt handlers. I usually don't do that, so I can see that overhead as part of the explanation for wall-clock time. But if you do, cycles:u can match very closely with instructions:u.

-r4 运行 4 次并取平均值,这有助于查看是否存在大量运行间变化,而不是仅从 ECX 中的较高值获得一个平均值.

-r4 runs it 4 times and averages, which can be useful to see if there's a lot of run-to-run variation instead of just getting one average from a higher value in ECX.

调整您的初始 ECX 值,使总时间大约为 0.1 到 1 秒,这通常足够了,特别是如果您的 CPU 非常快速地上升到最大涡轮增压(例如,具有硬件 P 状态和相当激进的 energy_performance_preference 的 Skylake).或禁用涡轮增压的最大非涡轮增压.

Adjust your initial ECX value to make the total time about 0.1 to 1 second, that's usually plenty, especially if your CPU ramps up to max turbo very quickly (e.g. Skylake with hardware P-states and a fairly aggressive energy_performance_preference). Or max non-turbo with turbo disabled.

但是这计入核心时钟周期,而不是参考周期,因此无论 CPU 频率如何变化,它仍然给出相同的结果.(+- 在转换期间停止时钟会产生一些噪音.)

But this counts in core clock cycles, not reference cycles, so it still gives the same result regardless of CPU frequency changes. (+- some noise from stopping the clock during the transition.)

这篇关于NASM 中的 RDTSCP 总是返回相同的值(对单个指令进行计时)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆