如何计算x86 linux上的asm延迟循环的时间? [英] How to calculate time for an asm delay loop on x86 linux?

查看:165
本文介绍了如何计算x86 linux上的asm延迟循环的时间?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在通过此链接延迟组装中,以增加组装延迟.我想通过添加不同的延迟值来进行一些实验.

有用的代码来产生延迟

; start delay

mov bp, 43690
mov si, 43690
delay2:
dec bp
nop
jnz delay2
dec si
cmp si,0    
jnz delay2
; end delay

我从代码中了解到,延迟与执行nop指令(43690x43690)所花费的时间成比例.因此,在不同的系统和不同的OS版本中,延迟将有所不同.我对吗?

任何人都可以向我解释如何计算以nsec为单位的延迟量,正在生成以下汇编代码,以便可以根据在实验设置中添加的延迟来结束实验?

这是我用来生成延迟的代码,但不了解使用43690值背后的逻辑(在原始源代码中,我只使用了一个循环,而没有使用两个循环).为了生成不同的延迟(不知道其值),我只是将数字43690更改为403690或其他值.

32位操作系统中的代码

movl  $43690, %esi   ; ---> if I vary this 4003690 then delay value ??
.delay2:
    dec %esi
    nop
    jnz .delay2

此汇编代码会产生多少延迟?

如果我想生成100nsec或1000nsec或任何其他微秒的延迟,我需要在寄存器中加载的初始值是什么?

我正在2.50GHz的Intel®CoreTM i5-7200U CPU和3.20GHz的Core-i3 CPU 3470使用ubuntu 16.04(32bit和64bit).

谢谢.

解决方案

在现代x86 PC上,尤其是在用户空间不足的情况下,没有一种很好的方法可以从固定计数中获得准确和可预测的定时来进行延迟循环(但是您可以在很短的延迟时间内启用rdtsc;请参见下文).如果您需要至少睡 足够长的时间,并且在出现问题时可以睡更长的时间,则可以使用简单的延迟循环.

通常情况下,您想睡觉并让操作系统唤醒您的进程,但这在Linux上仅延迟几微秒就不起作用. nanosleep可以表达它,但是内核不能以如此精确的时间进行调度.请参见>如何制作线程睡眠/阻塞几秒钟(或至少几毫秒)?.在启用了Meltdown + Spectre缓解功能的内核上,到内核的往返总耗时超过一微秒.

(或者您是在内核内部执行此操作?我认为Linux已经有一个经过校准的延迟循环.无论如何,它都有一个标准的延迟API:


在最近的Intel/AMD CPU上,每个内核时钟周期1次迭代您的(内部)循环是完全可以预测的,无论其中是否包含nop.它在4个融合域uops以下,因此您会遇到CPU的每1时钟环路吞吐量的瓶颈. (请参阅 Agner Fog的x86微体系结构指南,或者使用perf stat ./a.out自行安排较大的迭代次数.)除非在同一个物理内核上有来自另一个超线程的竞争 ...

或者,除非内部循环在Skylake或Kaby Lake上跨越32个字节的边界(通过微代码更新禁用循环缓冲区以解决设计错误).然后,您的dec / jnz循环可以每2个周期运行1次,因为它需要从2条不同的uop缓存行中提取.

我建议不要使用nop,以便在更多CPU上也有更好的机会使其每个时钟为1.无论如何,您都需要对其进行校准,因此较大的代码占用空间无济于事(因此也无需进行额外的对齐). (如果需要确保最短延迟时间,请确保在CPU处于最大加速状态时进行校准.)

如果您的内部循环不是那么小(例如更多的nop),请参见.

如果在启用了中断的情况下运行此命令,则中断是另一个无法预测的延迟源.(即使在内核模式下,Linux通常也启用了中断.禁用了中断的延迟循环可产生数万个时钟周期似乎是个坏主意.)

如果在用户空间中运行,那么我希望您使用的是具有实时支持的编译内核.但是即使那样,Linux也不是完全为硬实时操作而设计的,所以我不确定您能得到什么.

系统管理模式中断是即使内核也不知道的另一个延迟源. 性能影响 根据英特尔针对PC BIOS的测试套件,从2013年开始的系统管理模式指出,对于SMI而言,150微秒被认为是可接受的"延迟.现代个人电脑充满了伏都教徒.我认为/希望大多数主板上的固件没有太多的SMM开销,并且SMI在正常操作中很少见,但我不确定.另请参见评估SMI(系统管理中断)在Linux-CentOS/Intel机器上的延迟时间

极低功耗的Skylake CPU会按一定的占空比停止时钟,而不是降低时钟频率并持续运行.参见,以及英特尔在IDF2015上有关Skylake电源管理的演示文稿.


旋转RDTSC直到正确的挂钟时间

如果您确实需要忙碌,请打开rdtsc ,等待当前时间以达到截止日期.您需要知道 reference 频率,该频率与核心时钟无关,因此它是固定的和不间断的(在现代CPU上;对于不变和不间断的TSC,有CPUID功能位.Linux会对此进行检查,因此您可以在/proc/cpuinfo中查找constant_tscnonstop_tsc,但实际上您应该只在程序启动时自行检查CPUID并确定RDTSC频率(以某种方式...)).

作为愚蠢的计算机技巧练习的一部分,我写了这样一个循环:使用x86机器代码中最少字节的秒表.大多数代码大小用于字符串操作,以增加00:00:00显示并打印.我为我的CPU硬编码了4GHz RDTSC频率.

对于少于2 ^ 32个参考时钟的睡眠,您只需要查看计数器的低32位即可.如果您正确进行比较,则环绕操作会自动进行.对于1秒的秒表,4.3GHz的CPU会有问题,但是对于nsec/usec的睡眠没有问题.

 ;;; Untested,  NASM syntax

 default rel
 section .data
    ; RDTSC frequency in counts per 2^16 nanoseconds
    ; 3200000000 would be for a 3.2GHz CPU like your i3-3470

    ref_freq_fixedpoint: dd  3200000000 * (1<<16) / 1000000000

    ; The actual integer value is 0x033333
    ; which represents a fixed-point value of 3.1999969482421875 GHz
    ; use a different shift count if you like to get more fractional bits.
    ; I don't think you need 64-bit operand-size


 ; nanodelay(unsigned nanos /*edi*/)
 ; x86-64 System-V calling convention
 ; clobbers EAX, ECX, EDX, and EDI
 global nanodelay
 nanodelay:
      ; take the initial clock sample as early as possible.
      ; ideally even inline rdtsc into the caller so we don't wait for I$ miss.
      rdtsc                   ; edx:eax = current timestamp
      mov      ecx, eax       ; ecx = start
      ; lea ecx, [rax-30]    ; optionally bias the start time to account for overhead.  Maybe make this a variable stored with the frequency.

      ; then calculate edi = ref counts = nsec * ref_freq
      imul     edi, [ref_freq_fixedpoint]  ; counts * 2^16
      shr      edi, 16        ; actual counts, rounding down

.spinwait:                     ; do{
    pause         ; optional but recommended.
    rdtsc                      ;   edx:eax = reference cycles since boot
    sub      eax, ecx          ;   delta = now - start.  This may wrap, but the result is always a correct unsigned 0..n
    cmp      eax, edi          ; } while(delta < sleep_counts)
    jb     .spinwait

    ret

为了避免浮点数用于频率计算,我使用了像uint32_t ref_freq_fixedpoint = 3.2 * (1<<16);这样的定点.这意味着我们仅在延迟环路内使用整数乘法和移位. 在启动过程中使用C代码为CPU设置正确的ref_freq_fixedpoint .

如果为每个目标CPU重新编译它,则乘法常数可以是imul的立即操作数,而不是从内存中加载.

pause在Skylake上睡眠约100个时钟,但在以前的Intel uarches上仅睡眠约5个时钟.因此,这会稍微影响时序精度,在CPU频率降至〜1GHz时,可能会在最后期限之前休眠100 ns.或者以正常的〜3GHz速度,更像是+ 33ns.

连续运行,此循环在不使用pause的情况下将我的Skylake i7-6700k的一个内核在〜3.9GHz的温度下升高了约15摄氏度,但在pause下的温度却升高了约9摄氏度. (从约30°C的基线开始,使用大的CoolerMaster Gemini II热管冷却器,但要保持较低的气流以保持风扇噪音低.)

将开始时间测量值调整为比实际时间早,这将使您补偿一些额外的开销,例如离开循环时的分支错误预测以及first rdtsc可能直到其执行结束时才对时钟进行采样.乱序执行可以使rdtsc尽早运行.您可以使用lfence或考虑使用rdtscp来阻止第一个时钟样本在调用延迟函数之前先于指令发生乱序.

将偏移量保留在变量中也将使您可以校准常数偏移量.如果您可以在启动时自动执行此操作,那可能会很好地处理CPU之间的差异.但是您需要一些高精度的计时器才能工作,并且该计时器已经基于rdtsc.

将第一个RDTSC内联到调用程序中,并将低32位作为另一个函数arg传递,即使在调用delay函数时发生指令高速缓存未命中或其他流水线停顿的情况,也可以确保定时器"立即启动.因此,I $错过时间将是延迟间隔的一部分,而不是额外的开销.


rdtsc上旋转的优势:

如果发生任何延迟执行的事情,除非在截止日期过去之前当前执行被阻止(在这种情况下,您会陷入任何方法),否则循环仍会在截止日期前退出.

因此,您不必使用确切的n个CPU时间周期,而是使用CPU时间,直到当前时间比第一次检查时晚n * freq纳秒.

通过简单的计数器延迟循环,在4GHz时足够长的延迟将使您在0.8GHz时睡眠时间超过4倍(最近Intel CPU的典型最低频率).

这确实会运行rdtsc两次,因此不适用于仅几纳秒的延迟. (rdtsc本身约为20微秒,在Skylake/Kaby Lake上每25个时钟有一个吞吐量.)我认为这对于忙于数百或数千纳秒的等待来说可能是最差的解决方案,不过.

缺点:迁移到具有未同步TSC的另一个内核可能会导致睡眠时间错误.但是,除非您的延迟非常很长,否则迁移时间会更长比预期的延迟.最坏的情况是迁移后又在延迟时间内入睡.我做比较的方式:(now - start) < count,而不是寻找某个目标目标计数,意味着当now-start很​​大时,无符号环绕将使比较为真.计数器回绕时,您不会陷入将近整整一秒钟的睡眠中.

缺点:也许您休眠一定数量的核心周期,或者在CPU休眠时暂停计数.

缺点:旧CPU可能没有不间断/不变的TSC.在启动时检查这些CPUID功能位,并可能使用备用延迟循环,或者至少在校准时将其考虑在内.另请参阅获取CPU周期计数?,以获取有关RDTSC行为的规范答案. >


未来的CPU:在具有WAITPKG CPUID的CPU上使用 tpause 功能.

(我不知道将来有哪些CPU拥有此功能.)

类似于pause,但是将逻辑核心置于睡眠状态,直到TSC =您在EDX:EAX中提供的值为止.因此,您可以rdtsc找出当前时间,add / adc缩放到TSC的睡眠时间滴答到EDX:EAX,然后运行tpause.

有趣的是,它需要另一个输入寄存器,您可以在其中放置0以获得更深的睡眠(对另一个超线程更友好,可能会回到单线程模式),或者将1更快地唤醒并降低功耗-节省.

您不想用它睡觉几秒钟;您想将控制权交还给操作系统.但是,如果目标唤醒很远,您可以做一个操作系统睡眠来接近目标唤醒,然后在剩下的时间再按mov ecx,1xor ecx,ecx/tpause ecx.

与半相关(也是WAITPKG扩展的一部分)更有趣umonitor/umwait,当它看到内存中的内存变化时(像特权监视器/mwait)可以唤醒内核.地址范围.对于超时,它在TSC = EDX:EAX上具有与tpause相同的唤醒.

I was going through this link delay in assembly to add delay in assembly. I want to perform some experiment by adding different delay value.

The useful code to generate delay

; start delay

mov bp, 43690
mov si, 43690
delay2:
dec bp
nop
jnz delay2
dec si
cmp si,0    
jnz delay2
; end delay

What I understood from the code, the delay is proportion to the time it spends to execute nop instructions (43690x43690 ). So in different system and different version of OS, delay will be different. Am I right?

Can anyone explain to me how to calculate the amount of delay in nsec, the following assembly code is generating so that I can conclude my experiment with respect to delay I added in my experimental setup?

This is the code I am using to generate delay without understanding the logic behind use of 43690 value ( I used only one loop against two loops in original source code). To generate different delay (without knowing its value), I just varied number 43690 to 403690 or other value.

Code in 32bit OS

movl  $43690, %esi   ; ---> if I vary this 4003690 then delay value ??
.delay2:
    dec %esi
    nop
    jnz .delay2

How much delay is generated by this assembly code ?

If I want to generate 100nsec or 1000nsec or any other delay in microsec, what will be initial value I need to load in register?

I am using ubuntu 16.04 (both 32bit as well as 64bit ), in Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz and Core-i3 CPU 3470 @ 3.20GHz processor.

Thank you in advance.

解决方案

There is no very good way to get accurate and predictable timing from fixed counts for delay loops on a modern x86 PC, especially in user-space under a non-realtime OS like Linux. (But you could spin on rdtsc for very short delays; see below). You can use a simple delay-loop if you need to sleep at least long enough and it's ok to sleep longer when things go wrong.

Normally you want to sleep and let the OS wake your process, but this doesn't work for delays of only a couple microseconds on Linux. nanosleep can express it, but the kernel doesn't schedule with such precise timing. See How to make a thread sleep/block for nanoseconds (or at least milliseconds)?. On a kernel with Meltdown + Spectre mitigation enabled, a round-trip to the kernel takes longer than a microsecond anyway.

(Or are you doing this inside the kernel? I think Linux already has a calibrated delay loop. In any case, it has a standard API for delays: https://www.kernel.org/doc/Documentation/timers/timers-howto.txt, including ndelay(unsigned long nsecs) which uses the "jiffies" clock-speed estimate to sleep for at least long enough. IDK how accurate that is, or if it sometimes sleeps much longer than needed when clock speed is low, or if it updates the calibration as the CPU freq changes.)


Your (inner) loop is totally predictable at 1 iteration per core clock cycle on recent Intel/AMD CPUs, whether or not there's a nop in it. It's under 4 fused-domain uops, so you bottleneck on the 1-per-clock loop throughput of your CPUs. (See Agner Fog's x86 microarch guide, or time it yourself for large iteration counts with perf stat ./a.out.) Unless there's competition from another hyperthread on the same physical core...

Or unless the inner loop spans a 32-byte boundary, on Skylake or Kaby Lake (loop buffer disabled by microcode updates to work around a design bug). Then your dec / jnz loop could run at 1 per 2 cycles because it would require fetching from 2 different uop-cache lines.

I'd recommend leaving out the nop to have a better chance of it being 1 per clock on more CPUs, too. You need to calibrate it anyway, so a larger code footprint isn't helpful (so leave out extra alignment, too). (Make sure calibration happens while CPU is at max turbo, if you need to ensure a minimum delay time.)

If your inner loop wasn't quite so small (e.g. more nops), see Is performance reduced when executing loops whose uop count is not a multiple of processor width? for details on front-end throughput when the uop count isn't a multiple of 8. SKL / KBL with disabled loop buffers run from the uop cache even for tiny loops.


But x86 doesn't have a fixed clock frequency (and transitions between frequency states stop the clock for ~20k clock cycles (8.5us), on a Skylake CPU).

If running this with interrupts enabled, then interrupts are another unpredictable source of delays. (Even in kernel mode, Linux usually has interrupts enabled. An interrupts-disabled delay loop for tens of thousands of clock cycles seems like a bad idea.)

If running in user-space, then I hope you're using a kernel compiled with realtime support. But even then, Linux isn't fully designed for hard-realtime operation, so I'm not sure how good you can get.

System management mode interrupts are another source of delay that even the kernel doesn't know about. PERFORMANCE IMPLICATIONS OF SYSTEM MANAGEMENT MODE from 2013 says that 150 microseconds is considered an "acceptable" latency for an SMI, according to Intel's test suite for PC BIOSes. Modern PCs are full of voodoo. I think/hope that the firmware on most motherboards doesn't have much SMM overhead, and that SMIs are very rare in normal operation, but I'm not sure. See also Evaluating SMI (System Management Interrupt) latency on Linux-CentOS/Intel machine

Extremely low-power Skylake CPUs stop their clock with some duty-cycle, instead of clocking lower and running continuously. See this, and also Intel's IDF2015 presentation about Skylake power management.


Spin on RDTSC until the right wall-clock time

If you really need to busy-wait, spin on rdtsc waiting for the current time to reach a deadline. You need to know the reference frequency, which is not tied to the core clock, so it's fixed and nonstop (on modern CPUs; there are CPUID feature bits for invariant and nonstop TSC. Linux checks this, so you could look in /proc/cpuinfo for constant_tsc and nonstop_tsc, but really you should just check CPUID yourself on program startup and work out the RDTSC frequency (somehow...)).

I wrote such a loop as part of a silly-computer-tricks exercise: a stopwatch in the fewest bytes of x86 machine code. Most of the code size is for the string manipulation to increment a 00:00:00 display and print it. I hard-coded the 4GHz RDTSC frequency for my CPU.

For sleeps of less than 2^32 reference clocks, you only need to look at the low 32 bits of the counter. If you do your compare correctly, wrap-around takes care of itself. For the 1-second stopwatch, a 4.3GHz CPU would have a problem, but for nsec / usec sleeps there's no issue.

 ;;; Untested,  NASM syntax

 default rel
 section .data
    ; RDTSC frequency in counts per 2^16 nanoseconds
    ; 3200000000 would be for a 3.2GHz CPU like your i3-3470

    ref_freq_fixedpoint: dd  3200000000 * (1<<16) / 1000000000

    ; The actual integer value is 0x033333
    ; which represents a fixed-point value of 3.1999969482421875 GHz
    ; use a different shift count if you like to get more fractional bits.
    ; I don't think you need 64-bit operand-size


 ; nanodelay(unsigned nanos /*edi*/)
 ; x86-64 System-V calling convention
 ; clobbers EAX, ECX, EDX, and EDI
 global nanodelay
 nanodelay:
      ; take the initial clock sample as early as possible.
      ; ideally even inline rdtsc into the caller so we don't wait for I$ miss.
      rdtsc                   ; edx:eax = current timestamp
      mov      ecx, eax       ; ecx = start
      ; lea ecx, [rax-30]    ; optionally bias the start time to account for overhead.  Maybe make this a variable stored with the frequency.

      ; then calculate edi = ref counts = nsec * ref_freq
      imul     edi, [ref_freq_fixedpoint]  ; counts * 2^16
      shr      edi, 16        ; actual counts, rounding down

.spinwait:                     ; do{
    pause         ; optional but recommended.
    rdtsc                      ;   edx:eax = reference cycles since boot
    sub      eax, ecx          ;   delta = now - start.  This may wrap, but the result is always a correct unsigned 0..n
    cmp      eax, edi          ; } while(delta < sleep_counts)
    jb     .spinwait

    ret

To avoid floating-point for the frequency calculation, I used fixed-point like uint32_t ref_freq_fixedpoint = 3.2 * (1<<16);. This means we just use an integer multiply and shift inside the delay loop. Use C code to set ref_freq_fixedpoint during startup with the right value for the CPU.

If you recompile this for each target CPU, the multiply constant can be an immediate operand for imul instead of loading from memory.

pause sleeps for ~100 clock on Skylake, but only for ~5 clocks on previous Intel uarches. So it hurts timing precision a bit, maybe sleeping up to 100 ns past a deadline when the CPU frequency is clocked down to ~1GHz. Or at a normal ~3GHz speed, more like up to +33ns.

Running continously, this loop heated up one core of my Skylake i7-6700k at ~3.9GHz by ~15 degrees C without pause, but only by ~9 C with pause. (From a baseline of ~30C with a big CoolerMaster Gemini II heatpipe cooler, but low airflow in the case to keep fan noise low.)

Adjusting the start-time measurement to be earlier than it really is will let you compensate for some of the extra overhead, like branch-misprediction when leaving the loop, as well as the fact that the first rdtsc doesn't sample the clock until probably near the end of its execution. Out-of-order execution can let rdtsc run early; you might use lfence, or consider rdtscp, to stop the first clock sample from happening out-of-order ahead of instructions before the delay function is called.

Keeping the offset in a variable will let you calibrate the constant offset, too. If you can do this automatically at startup, that could be good to handle variations between CPUs. But you need some high-accuracy timer for that to work, and this is already based on rdtsc.

Inlining the first RDTSC into the caller and passing the low 32 bits as another function arg would make sure the "timer" starts right away even if there's an instruction-cache miss or other pipeline stall when calling the delay function. So the I$ miss time would be part of the delay interval, not extra overhead.


The advantage of spinning on rdtsc:

If anything happens that delays execution, the loop still exits at the deadline, unless execution is currently blocked when the deadline passes (in which case you're screwed with any method).

So instead of using exactly n cycles of CPU time, you use CPU time until the current time is n * freq nanoseconds later than when you first checked.

With a simple counter delay loop, a delay that's long enough at 4GHz would make you sleep more than 4x too long at 0.8GHz (typical minimum frequency on recent Intel CPUs).

This does run rdtsc twice, so it's not appropriate for delays of only a couple nanoseconds. (rdtsc itself is ~20 uops, and has a throughput of one per 25 clocks on Skylake/Kaby Lake.) I think this is probably the least bad solution for a busy-wait of hundreds or thousands of nanoseconds, though.

Downside: a migration to another core with unsynced TSC could result in sleeping for the wrong time. But unless your delays are very long, the migration time will be longer than the intended delay. The worst case is sleeping for the delay-time again after the migration. The way I do the compare: (now - start) < count, instead of looking for a certain target target count, means that unsigned wraparound will make the compare true when now-start is a large number. You can't get stuck sleeping for nearly a whole second while the counter wraps around.

Downside: maybe you want to sleep for a certain number of core cycles, or to pause the count when the CPU is asleep.

Downside: old CPUs may not have a non-stop / invariant TSC. Check these CPUID feature bits at startup, and maybe use an alternate delay loop, or at least take it into account when calibrating. See also Get CPU cycle count? for my attempt at a canonical answer about RDTSC behaviour.


Future CPUs: use tpause on CPUs with the WAITPKG CPUID feature.

(I don't know which future CPUs are expected to have this.)

It's like pause, but puts the logical core to sleep until the TSC = the value you supply in EDX:EAX. So you could rdtsc to find out the current time, add / adc the sleep time scaled to TSC ticks to EDX:EAX, then run tpause.

Interestingly, it takes another input register where you can put a 0 for a deeper sleep (more friendly to the other hyperthread, probably drops back to single-thread mode), or 1 for faster wakeup and less power-saving.

You wouldn't want to use this to sleep for seconds; you'd want to hand control back to the OS. But you could do an OS sleep to get close to your target wakeup if it's far away, then mov ecx,1 or xor ecx,ecx / tpause ecx for whatever time is left.

Semi-related (also part of the WAITPKG extension) are the even more fun umonitor / umwait, which (like privileged monitor/mwait) can have a core wake up when it sees a change to memory in an address range. For a timeout, it has the same wakeup on TSC = EDX:EAX as tpause.

这篇关于如何计算x86 linux上的asm延迟循环的时间?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆