NASM中的RDTSCP始终返回相同的值 [英] RDTSCP in NASM always returns the same value

查看:422
本文介绍了NASM中的RDTSCP始终返回相同的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在NASM中使用RDTSC和RDTSCP来测量机器周期,以获得各种汇编语言指令,以帮助优化.

I am using RDTSC and RDTSCP in NASM to measure machine cycles for various assembly language instructions to help in optimization.

我读了Intel的Gabriele Paoloni撰写的如何在Intel IA-32和IA-64指令集体系结构上对代码执行时间进行基准测试"(2010年9月)和其他Web资源(其中大多数是C语言中的示例).

I read "How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures" by Gabriele Paoloni at Intel (September 2010) and other web resources (most of which were examples in C).

使用下面的代码(从C转换),我测试了各种指令,但是RDTSCP在RDX中始终返回零,在RAX中始终返回7.我首先认为7是周期数,但显然并非所有指令都需要7个周期.

Using the code below (translated from C), I test various instructions, but RDTSCP always returns zero in RDX and 7 in RAX. I first thought 7 is the number of cycles, but obviously not all instructions take 7 cycles.

rdtsc
cpuid
addsd xmm14,xmm1 ; Instruction to time
rdtscp
cpuid

这将返回7,这并不奇怪,因为在某些体系结构上,添加了7个周期(包括延迟).可以(根据某些情况)将前两个指令颠倒过来,先是cpuid,然后是rdtsc,但这在这里没有什么区别.

This returns 7, which is not surprising because on some architectures addsd is 7 cycles with latency included. The first two instructions can (according to some) be reversed, cpuid first then rdtsc, but that makes no difference here.

当我将指令更改为2周期指令时:

When I change the instruction to a 2-cycle instruction:

rdtsc
cpuid
add rcx,rdx ; Instruction to time
rdtscp
cpuid

这还会在rax中返回7,在rdx中返回零.

This also returns 7 in rax and zero in rdx.

所以我的问题是:

  1. 如何访问和解释RDX:RAX中返回的值?

  1. How do I access and interpret the values returned in RDX:RAX?

为什么RDX总是返回零,它应该返回什么?

Why does RDX always return zero, and what is it supposed to return?

更新:

如果我将代码更改为此:

If I change the code to this:

cpuid
rdtsc
mov [start_time],rax
addsd xmm14,xmm1 ; INSTRUCTION
rdtscp
mov [end_time],rax
cpuid
mov rax,[end_time]
mov rdx,[start_time]
sub rax,rdx

我的rax达到64,但这听起来像是循环太多.

I get 64 in rax, but that sounds like too many cycles.

推荐答案

您的第一个代码(导致标题问题)有问题,因为它用cpuid结果覆盖了rdtscrdtscp结果在EAX,EBX,ECX和EDX中.

Your first code (leading to the title question) is buggy because it overwrites the rdtsc and rdtscp results with the cpuid results in EAX,EBX,ECX and EDX.

使用lfence代替cpuid ;自从永远在Intel上启用了Spectre缓解功能的AMD上以来,lfence会序列化指令流,并因此通过rdtsc执行您想要的操作.

Use lfence instead of cpuid; on Intel since forever and AMD with Spectre mitigation enabled, lfence will serialize the instruction stream and thus do what you want with rdtsc.

请记住,RDTSC会计算参考周期,而不是核心时钟周期.

Remember that RDTSC counts reference cycles, not core clock cycles. Get CPU cycle count? for that and more about RDTSC.

您的测量间隔内没有cpuidlfence.但是您确实在测量间隔内具有rdtscp本身.背靠背rdtscp并不快,如果不预热CPU就可以运行 64个参考周期,这完全是合理的.空闲时钟速度通常比参考周期慢很多; 1个参考周期等于或接近贴纸"频率,例如英特尔CPU上的最大非涡轮持续频率.例如在"4GHz" Skylake CPU上为4008 MHz.

You don't have cpuid or lfence inside your measurement interval. But you do have rdtscp itself in the measurement interval. Back-to-back rdtscp is not fast, 64 reference cycles sounds totally reasonable if you ran without warming up the CPU. Idle clock speed is usually a lot slower than a reference cycle; 1 reference cycle is equal or close to the "sticker" frequency, e.g. max non-turbo sustained frequency, on Intel CPUs. e.g. 4008 MHz on a "4GHz" Skylake CPU.

重要的是在另一条指令可以使用结果之前的等待时间,而不是直到它从无序后端完全退出之前的等待时间. RDTSC有助于计时相对变化一次加载或一条存储指令要花多长时间,但是开销意味着您不会获得良好的绝对时间.

What matters is latency before another instruction can use the result, not latency until it fully retires from the out-of-order back-end. RDTSC can be useful for timing relative variations in how long one load or one store instruction takes, but the overhead means you won't get a good absolute time.

不过,您可以尝试减少测量开销.例如 clflush通过C函数使缓存行无效 .并参阅后续内容:使用时间戳计数器和clock_gettime缓存未命中带有时间戳计数器的内存延迟测量.

You can try to subtract measurement overhead, though. e.g. clflush to invalidate cache line via C function. And see also the followups: Using time stamp counter and clock_gettime for cache miss and Memory latency measurement with time stamp counter.

这是我通常用来描述短块指令的延迟或吞吐量(以及uops融合和非融合域)的方法.调整使用它的方式来限制延迟(如此处所示),如果您只想测试吞吐量,则不要调整.例如使用%rep块具有足够的不同寄存器来隐藏等待时间,或者在较短的块之后使用pxor xmm3, xmm3打破依赖关系链,让无序的exec发挥其魔力. (只要您没有前端的瓶颈.)

This is what I usually use to profile latency or throughput (and uops fused and unfused domain) of an instruction of short block. Adjust how you use it to bottleneck on latency like here, or not if you want to just test throughput. e.g. with a %rep block with enough different registers to hide latency, or break dependency chains with a pxor xmm3, xmm3 after a short block and let out-of-order exec work its magic. (As long as you don't bottleneck on the front-end.)

您可能想要使用NASM的smartalign软件包或使用YASM,以避免ALIGN指令出现单字节NOP指令的麻烦.即使在始终支持long-NOP的64位模式下,NASM也会默认设置为真正愚蠢的NOP.

You might want to use NASM's smartalign package, or use YASM, to avoid a wall of single-byte NOP instructions for the ALIGN directive. NASM defaults to really stupid NOPs even in 64-bit mode where long-NOP is always supported.

global _start
_start:
    mov   ecx, 1000000000
; linux static executables start with XMM0..15 already zeroed
align 32                     ; just for good measure to avoid uop-cache effects
.loop:
    ;; LOOP BODY, put whatever you want to time in here
    times 4   addsd  xmm4, xmm3

    dec   ecx
    jnz   .loop

    mov  eax, 231
    xor  edi, edi
    syscall          ; x86-64 Linux sys_exit_group(0)

使用类似这种单行代码的方式运行此程序,将其链接到静态可执行文件并使用perf stat对其进行配置,您可以在每次更改源代码时向上箭头键并重新运行 :

Run this with something like this one-liner that links it into a static executable and profiles it with perf stat, which you can up-arrow and re-run every time you change the source:

((我实际上将nasm + ld +可选反汇编成一个名为asm-link的shell脚本,以在不进行概要分析时保存键入内容.反汇编可确保循环中的内容是您的 meant 配置文件,尤其是在代码中包含某些%if内容的情况下.如果您想在脑海中测试理论时回滚,它也位于配置文件之前的终端上.)

(I actually put the nasm+ld + optional disassemble into a shell script called asm-link, to save typing when I'm not profiling. Disassembling makes sure that what's in your loop is what you meant to profile, especially if you have some %if stuff in your code. And also so it's on your terminal right before the profile, if you want to scroll back while testing theories in your head.)

t=testloop; nasm -felf64 -g "$t.asm" && ld "$t.o" -o "$t" &&  objdump -drwC -Mintel "$t" &&
 taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread -r4 ./"$t"

i7-6700k在3.9GHz时的结果(当前perf在第二列中具有单位缩放显示错误.已在上游修复,但Arch Linux尚未更新.):

Result from i7-6700k at 3.9GHz (current perf has a unit-scaling display bug for the secondary column. It's fixed upstream but Arch Linux hasn't updated yet.):

 Performance counter stats for './testloop' (4 runs):

          4,106.09 msec task-clock                #    1.000 CPUs utilized            ( +-  0.01% )
                17      context-switches          #    4.080 M/sec                    ( +-  5.65% )
                 0      cpu-migrations            #    0.000 K/sec                  
                 2      page-faults               #    0.487 M/sec                  
    16,012,778,144      cycles                    # 3900323.504 GHz                   ( +-  0.01% )
     1,001,537,894      branches                  # 243950284.862 M/sec               ( +-  0.00% )
     6,008,071,198      instructions              #    0.38  insn per cycle           ( +-  0.00% )
     5,013,366,769      uops_issued.any           # 1221134275.667 M/sec              ( +-  0.01% )
     5,013,217,655      uops_executed.thread      # 1221097955.182 M/sec              ( +-  0.01% )

          4.106283 +- 0.000536 seconds time elapsed  ( +-  0.01% )

在我的i7-6700k(Skylake)上,addsd具有4个周期的延迟,吞吐量为0.5c. (即,如果延迟不是瓶颈,则每个时钟2个).参见 https://agner.org/optimize/ http://instlatx64. atw.hu/.

On my i7-6700k (Skylake), addsd has 4 cycle latency, 0.5c throughput. (i.e. 2 per clock, if latency wasn't the bottleneck). See https://agner.org/optimize/, https://uops.info/, and http://instlatx64.atw.hu/.

每个分支16个循环=每条链16个循环,每个4个addsd = 4个循环的潜伏期,对于addsd,即使该测试包括很少的启动开销和中断开销.

16 cycles per branch = 16 cycles per chain of 4 addsd = 4 cycle latency for addsd, reproducing Agner Fog's measurement of 4 cycles to better than 1 part in 100 even for this test that includes a tiny bit of startup overhead, and interrupt overhead.

选择不同的计数器进行记录.将:u(例如instructions:u)添加到性能中,甚至只会计算用户空间指令,不包括在中断处理程序中运行的指令.我通常不这样做,因此我可以将其作为挂钟时间说明的一部分.但是,如果这样做,cycles:u可以将非常instructions:u紧密匹配.

Take your pick of different counters to record. Adding a :u, like instructions:u to a perf even will count only user-space instructions, excluding any that ran during interrupt handlers. I usually don't do that, so I can see that overhead as part of the explanation for wall-clock time. But if you do, cycles:u can match very closely with instructions:u.

-r4将其运行4次并取平均值,这对于查看是否存在大量的运行差异很有用,而不是仅从ECX中较高的值中获得一个平均值.

-r4 runs it 4 times and averages, which can be useful to see if there's a lot of run-to-run variation instead of just getting one average from a higher value in ECX.

调整您的初始ECX值使总时间大约为0.1到1秒,这通常就足够了,尤其是当您的CPU非常迅速地加速到最大加速时(例如,具有硬件P状态和相当激进的energy_performance_preference的Skylake).或在禁用涡轮的情况下使用最大非涡轮增压.

Adjust your initial ECX value to make the total time about 0.1 to 1 second, that's usually plenty, especially if your CPU ramps up to max turbo very quickly (e.g. Skylake with hardware P-states and a fairly aggressive energy_performance_preference). Or max non-turbo with turbo disabled.

但这是核心时钟周期,而不是参考周期,因此无论CPU频率如何变化,其结果都相同. (+-过渡期间停止时钟会产生一些噪音.)

But this counts in core clock cycles, not reference cycles, so it still gives the same result regardless of CPU frequency changes. (+- some noise from stopping the clock during the transition.)

这篇关于NASM中的RDTSCP始终返回相同的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆