为什么循环指令很慢?英特尔不能有效地实施它吗? [英] Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?

查看:21
本文介绍了为什么循环指令很慢?英特尔不能有效地实施它吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

LOOP(英特尔参考手册输入)递减 ecx/rcx,然后如果非零则跳转.它很慢,但英特尔不能便宜地让它快起来吗?dec/jnz 已经宏保险丝进入 Sandybridge 家族的一个 uop;唯一的区别是设置标志.

loop 在各种微架构上,来自 Agner Fog 的指令表:>

  • K8/K10:7 次操作

  • Bulldozer-family/Ryzen:1 m-op(与宏融合测试分支或 jecxz 的成本相同)

  • P4:4 uops(与 jecxz 相同)

  • P6 (PII/PIII):8 uop

  • 奔腾 M,Core2:11 uop

  • 尼哈勒姆:6 uop.(loope/loopne 为 11).吞吐量 = 4c (loop) 或 7c (loope/ne).

  • SnB-family:7 uop.(loope/loopne 为 11).吞吐量 = 每 5 个周期一个,这与将循环计数器保持在内存中一样是一个瓶颈!jecxz 只有 2 uops,吞吐量与常规 jcc

    相同
  • Silvermont:7 uop

  • AMD Jaguar(低功耗):8 uop,5c 吞吐量

  • 通过 Nano3000:2 uop


解码器不能像 lea rcx, [rcx-1]/jrcxz 一样解码吗?那将是 3 uop.至少在没有地址大小前缀的情况下是这样,否则它必须使用 ecx 并将 RIP 截断为 EIP 如果跳转是采取;也许控制递减宽度的地址大小的奇怪选择解释了许多 uops?(有趣的事实:rep-string 指令与使用 的行为相同ecx 具有 32 位地址大小.)

或者更好,只是将其解码为不设置标志的融合dec-and-branch?SnB 上的 dec ecx/jnz 解码为单个 uop(设置标志).

我知道真正的代码没有使用它(因为它至少从 P5 开始就很慢了),但是 AMD 认为为 Bulldozer 让它更快是值得的.可能是因为这很容易.


  • SnB 系列 uarch 是否容易拥有快速loop?如果是这样,为什么不呢?如果不是,为什么很难?很多解码器晶体管?或者在融合的 dec&branch uop 中使用额外的位来记录它没有设置标志?那 7 uop 能做什么?这是一个非常简单的说明.

  • 让快速loop变得容易/值得的Bulldozer有什么特别之处?还是AMD在制作loop上浪费了一堆晶体管代码> 快?如果是这样,大概有人认为这是个好主意.


如果 loop 速度快,那么它非常适合 BigInteger 任意精度adc 循环,以避免部分标志停顿/减速(参见我的评论我的答案),或任何其他您想在不触及标志的情况下循环的情况.与 dec/jnz 相比,它还具有较小的代码大小优势.(并且 dec/jnz 仅适用于 SnB 系列的宏熔断器).

dec/jnz 在 ADC 循环中正常的现代 CPU 上,loop 仍然适用于 ADCX/ADOX 循环(以保留 OF).

如果 loop 速度很快,编译器就会将其用作没有宏融合的 CPU 上代码大小和速度的窥视孔优化.


它不会阻止我对所有使用 loop 的糟糕 16 位代码的问题感到恼火,即使它们在循环中还需要另一个计数器.但至少它不会一样糟糕.

解决方案

1988 年,IBM 研究员 Glenn Henry 刚刚加入戴尔,当时该公司有几百名员工,在他的第一个月里,他做了一个关于 386 内部构件的技术演讲.我们一群 BIOS 程序员一直想知道为什么 LOOP 比 DEC/JNZ 慢,所以在问答部分有人提出了这个问题.

他的回答很有道理.它与分页有关.

LOOP 由两部分组成:递减 CX,如果 CX 不为零则跳转.第一部分不能导致处理器异常,而跳转部分可以.一方面,您可以跳转(或跳过)到段边界外的地址,从而导致 SEGFAULT.对于两个,您可以跳转到换出的页面.

SEGFAULT 通常表示进程结束,但缺页错误则不同.发生页面错误时,处理器会抛出异常,操作系统会进行内务处理以将页面从磁盘交换到 RAM.之后,它重新启动导致故障的指令.

重启意味着将进程的状态恢复到违规指令之前的状态.特别是在 LOOP 指令的情况下,它意味着恢复 CX 寄存器的值.有人可能认为你可以给 CX 加 1,因为我们知道 CX 减少了,但显然,这并不那么简单.例如,查看这个 勘误来自英特尔:

<块引用>

所涉及的保护违规通常表明可能存在如果这些违规之一,则不需要软件错误和重新启动发生.在具有等待状态的保护模式 80286 系统中总线周期,当某些保护违规被检测到时80286 组件,组件将控制权转移给异常CX 寄存器的内容可能不可靠.(CX 内容是否改变是总线活动在内部微码检测到保护违规的时间.)

为了安全起见,他们需要在 LOOP 指令的每次迭代中保存 CX 的值,以便在需要时可靠地恢复它.

正是这种节省 CX 的额外负担使 LOOP 变得如此缓慢.

英特尔和当时的其他所有人一样,越来越多地采用 RISC.旧的 CISC 指令(LOOP、ENTER、LEAVE、BOUND)正在逐步淘汰.我们仍然在手工编码的汇编中使用它们,但编译器完全忽略了它们.

LOOP (Intel ref manual entry) decrements ecx / rcx, and then jumps if non-zero. It's slow, but couldn't Intel have cheaply made it fast? dec/jnz already macro-fuses into a single uop on Sandybridge-family; the only difference being that that sets flags.

loop on various microarchitectures, from Agner Fog's instruction tables:

  • K8/K10: 7 m-ops

  • Bulldozer-family/Ryzen: 1 m-op (same cost as macro-fused test-and-branch, or jecxz)

  • P4: 4 uops (same as jecxz)

  • P6 (PII/PIII): 8 uops

  • Pentium M, Core2: 11 uops

  • Nehalem: 6 uops. (11 for loope / loopne). Throughput = 4c (loop) or 7c (loope/ne).

  • SnB-family: 7 uops. (11 for loope / loopne). Throughput = one per 5 cycles, as much of a bottleneck as keeping your loop counter in memory! jecxz is only 2 uops with same throughput as regular jcc

  • Silvermont: 7 uops

  • AMD Jaguar (low-power): 8 uops, 5c throughput

  • Via Nano3000: 2 uops


Couldn't the decoders just decode the same as lea rcx, [rcx-1] / jrcxz? That would be 3 uops. At least that would be the case with no address-size prefix, otherwise it has to use ecx and truncate RIP to EIP if the jump is taken; maybe the odd choice of address-size controlling the width of the decrement explains the many uops? (Fun fact: rep-string instructions have the same behaviour with using ecx with 32-bit address-size.)

Or better, just decode it as a fused dec-and-branch that doesn't set flags? dec ecx / jnz on SnB decodes to a single uop (which does set flags).

I know that real code doesn't use it (because it's been slow since at least P5 or something), but AMD decided it was worth it to make it fast for Bulldozer. Probably because it was easy.


  • Would it be easy for SnB-family uarch to have fast loop? If so, why don't they? If not, why is it hard? A lot of decoder transistors? Or extra bits in a fused dec&branch uop to record that it doesn't set flags? What could those 7 uops be doing? It's a really simple instruction.

  • What's special about Bulldozer that made a fast loop easy / worth it? Or did AMD waste a bunch of transistors on making loop fast? If so, presumably someone thought it was a good idea.


If loop was fast, it would be perfect for BigInteger arbitrary-precision adc loops, to avoid partial-flag stalls / slowdowns (see my comments on my answer), or any other case where you want to loop without touching flags. It also has a minor code-size advantage over dec/jnz. (And dec/jnz only macro-fuses on SnB-family).

On modern CPUs where dec/jnz is ok in an ADC loop, loop would still be nice for ADCX / ADOX loops (to preserve OF).

If loop had been fast, compilers would already be using it as a peephole optimization for code-size + speed on CPUs without macro-fusion.


It wouldn't stop me from getting annoyed at all the questions with bad 16bit code that uses loop for every loop, even when they also need another counter inside the loop. But at least it wouldn't be as bad.

解决方案

In 1988, IBM fellow Glenn Henry had just come on board at Dell, which had a few hundred employees at the time, and in his first month he gave a tech talk about 386 internals. A bunch of us BIOS programmers had been wondering why LOOP was slower than DEC/JNZ so during the question/answer section somebody posed the question.

His answer made sense. It had to do with paging.

LOOP consists of two parts: decrementing CX, then jumping if CX is not zero. The first part cannot cause a processor exception, whereas the jump part can. For one, you could jump (or fall through) to an address outside segment boundaries, causing a SEGFAULT. For two, you could jump to a page that is swapped out.

A SEGFAULT usually spells the end for a process, but page faults are different. When a page fault occurs, the processor throws an exception, and the OS does the housekeeping to swap in the page from disk into RAM. After that, it restarts the instruction that caused the fault.

Restarting means restoring the state of the process to what it was just before the offending instruction. In the case of the LOOP instruction in particular, it meant restoring the value of the CX register. One might think you could just add 1 to CX, since we know CX got decremented, but apparently, it's not that simple. For example, check out this erratum from Intel:

The protection violations involved usually indicate a probable software bug and restart is not desired if one of these violations occurs. In a Protected Mode 80286 system with wait states during any bus cycles, when certain protection violations are detected by the 80286 component, and the component transfers control to the exception handling routine, the contents of the CX register may be unreliable. (Whether CX contents are changed is a function of bus activity at the time internal microcode detects the protection violation.)

To be safe, they needed to save the value of CX on every iteration of a LOOP instruction, in order to reliably restore it if needed.

It's this extra burden of saving CX that made LOOP so slow.

Intel, like everyone else at the time, was getting more and more RISC. The old CISC instructions (LOOP, ENTER, LEAVE, BOUND) were being phased out. We still used them in hand-coded assembly, but compilers ignored them completely.

这篇关于为什么循环指令很慢?英特尔不能有效地实施它吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆