某些 CPU 上紧密循环中的 ADC/SBB 和 INC/DEC 问题 [英] Problems with ADC/SBB and INC/DEC in tight loops on some CPUs

查看:28
本文介绍了某些 CPU 上紧密循环中的 ADC/SBB 和 INC/DEC 问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用 Delphi 编写一个简单的 BigInteger 类型.它主要由 TLimb 的动态数组组成,其中 TLimb 是一个 32 位无符号整数,以及一个 32 位大小的字段,该字段还保存 BigInteger 的符号位.

I am writing a simple BigInteger type in Delphi. It mainly consists of a dynamic array of TLimb, where a TLimb is a 32 bit unsigned integer, and a 32 bit size field, which also holds the sign bit for the BigInteger.

要添加两个 BigInteger,我创建了一个适当大小的新 BigInteger,然后在进行一些簿记之后,调用以下过程,将三个指针传递给左操作数和右操作数以及结果的数组的各自开头,以及左右分别的肢体数.

To add two BigIntegers, I create a new BigInteger of the appropriate size and then, after some bookkeeping, call the following procedure, passing it three pointers to the respective starts of the arrays for the left and right operand and the result, as well as the numbers of limbs for left and right, respectively.

纯代码:

class procedure BigInteger.PlainAdd(Left, Right, Result: PLimb; LSize, RSize: Integer); 
asm
// EAX = Left, EDX = Right, ECX = Result
        PUSH    ESI
        PUSH    EDI
        PUSH    EBX
        MOV     ESI,EAX                 // Left
        MOV     EDI,EDX                 // Right
        MOV     EBX,ECX                 // Result
        MOV     ECX,RSize               // Number of limbs at Left
        MOV     EDX,LSize               // Number of limbs at Right
        CMP     EDX,ECX
        JAE     @SkipSwap
        XCHG    ECX,EDX                 // Left and LSize should be largest
        XCHG    ESI,EDI                 // so swap
@SkipSwap:
        SUB     EDX,ECX                 // EDX contains rest
        PUSH    EDX                     // ECX contains smaller size
        XOR     EDX,EDX                  
@MainLoop:
        MOV     EAX,[ESI + CLimbSize*EDX]  // CLimbSize = SizeOf(TLimb) = 4.
        ADC     EAX,[EDI + CLimbSize*EDX]
        MOV     [EBX + CLimbSize*EDX],EAX
        INC     EDX
        DEC     ECX
        JNE     @MainLoop
        POP     EDI                        
        INC     EDI                        // Do not change Carry Flag
        DEC     EDI
        JE      @LastLimb
@RestLoop:
        MOV     EAX,[ESI + CLimbSize*EDX]
        ADC     EAX,ECX
        MOV     [EBX + CLimbSize*EDX],EAX
        INC     EDX
        DEC     EDI
        JNE     @RestLoop
@LastLimb:
        ADC     ECX,ECX                    // Add in final carry
        MOV     [EBX + CLimbSize*EDX],ECX
@Exit:
        POP     EBX
        POP     EDI
        POP     ESI
end;
// RET is inserted by Delphi compiler.

这段代码运行良好,我对它非常满意,直到我注意到,在我的开发设置(iMac 上的 Parallels VM 中的 Win7)中,有一个简单的 PURE PASCAL 加法例程,在模拟进位时执行相同的操作一个变量和几个 if 子句,比我简单、直接的手工汇编程序更快.

This code worked well, and I was pretty satisified with it, until I noticed that, on my development setup (Win7 in a Parallels VM on an iMac) a simple PURE PASCAL addition routine, doing the same while emulating the carry with a variable and a few if clauses, was faster than my plain, straightforward handcrafted assembler routine.

我花了一段时间才发现在某些 CPU(包括我的 iMac 和旧笔记本电脑)上,DECINCADC 的组合SBB 可能会非常慢.但是在我的大多数其他电脑上(我还有五台其他 PC 来测试它,尽管其中四台完全相同),速度相当快.

It took me a while to find out that on certain CPUs (including my iMac and an older laptop), the combination of DEC or INC and ADC or SBB could be extremely slow. But on most of my others (I have five other PCs to test it on, although four of these are exactly the same), it was quite fast.

所以我写了一个新版本,使用 LEAJECXZ 来模拟 INCDEC,就像这样:

So I wrote a new version, emulating INC and DEC using LEA and JECXZ instead, like so:

部分模拟代码:

@MainLoop:
        MOV     EAX,[ESI + EDX*CLimbSize]
        LEA     ECX,[ECX - 1]                   // Avoid INC and DEC, see above.
        ADC     EAX,[EDI + EDX*CLimbSize]
        MOV     [EBX + EDX*CLimbSize],EAX
        LEA     EDX,[EDX + 1]
        JECXZ   @DoRestLoop                     // LEA does not modify Zero flag, so JECXZ is used.
        JMP     @MainLoop
@DoRestLoop:
// similar code for the rest loop 

这使我的代码缓慢"机器几乎快三倍,但在更快"的情况下慢了大约 20%.机器.所以现在,作为初始化代码,我做了一个简单的计时循环,并使用它来决定是否将单元设置为调用普通例程或模拟例程.这几乎总是正确的,但有时它应该选择模拟例程时选择(较慢的)普通例程.

That made my code on the "slow" machines almost three times as fast, but some 20% slower on the "faster" machines. So now, as initialzation code, I do a simple timing loop and use that to decide if I'll set up the unit to call the plain or the emulated routine(s). This is almost always correct, but sometimes it chooses the (slower) plain routines when it should have chosen the emulating routines.

但我不知道这是否是最好的方法.

But I don't know if this is the best way to do this.

我给出了我的解决方案,但这里的 asm 专家是否知道避免某些 CPU 运行缓慢的更好方法?

I gave my solution, but do the asm gurus here perhaps know a better way to avoid the slowness on certain CPUs?

Peter 和 Nils 的回答帮助我走上正轨.这是我对 DEC 版本的最终解决方案的主要部分:

Peter and Nils' answers helped me a lot to get on the right track. This is the main part of my final solution for the DEC version:

纯代码:

class procedure BigInteger.PlainAdd(Left, Right, Result: PLimb; LSize, RSize: Integer);
asm
        PUSH    ESI
        PUSH    EDI
        PUSH    EBX
        MOV     ESI,EAX                         // Left
        MOV     EDI,EDX                         // Right
        MOV     EBX,ECX                         // Result
        MOV     ECX,RSize
        MOV     EDX,LSize
        CMP     EDX,ECX
        JAE     @SkipSwap
        XCHG    ECX,EDX
        XCHG    ESI,EDI
@SkipSwap:
        SUB     EDX,ECX
        PUSH    EDX
        XOR     EDX,EDX
        XOR     EAX,EAX
        MOV     EDX,ECX
        AND     EDX,$00000003
        SHR     ECX,2
        CLC
        JE      @MainTail
@MainLoop:
        // Unrolled 4 times. More times will not improve speed anymore.
        MOV     EAX,[ESI]
        ADC     EAX,[EDI]
        MOV     [EBX],EAX
        MOV     EAX,[ESI + CLimbSize]
        ADC     EAX,[EDI + CLimbSize]
        MOV     [EBX + CLimbSize],EAX
        MOV     EAX,[ESI + 2*CLimbSize]
        ADC     EAX,[EDI + 2*CLimbSize]
        MOV     [EBX + 2*CLimbSize],EAX
        MOV     EAX,[ESI + 3*CLimbSize]
        ADC     EAX,[EDI + 3*CLimbSize]
        MOV     [EBX + 3*CLimbSize],EAX
        // Update pointers.
        LEA     ESI,[ESI + 4*CLimbSize]
        LEA     EDI,[EDI + 4*CLimbSize]
        LEA     EBX,[EBX + 4*CLimbSize]
        // Update counter and loop if required.
        DEC     ECX                             
        JNE     @MainLoop
@MainTail:
        // Add index*CLimbSize so @MainX branches can fall through.
        LEA     ESI,[ESI + EDX*CLimbSize]
        LEA     EDI,[EDI + EDX*CLimbSize]
        LEA     EBX,[EBX + EDX*CLimbSize]
        // Indexed jump.
        LEA     ECX,[@JumpsMain]
        JMP     [ECX + EDX*TYPE Pointer]
        // Align jump table manually, with NOPs. Update if necessary.
        NOP
// Jump table.
@JumpsMain:
        DD      @DoRestLoop
        DD      @Main1
        DD      @Main2
        DD      @Main3
@Main3:
        MOV     EAX,[ESI - 3*CLimbSize]
        ADC     EAX,[EDI - 3*CLimbSize]
        MOV     [EBX - 3*CLimbSize],EAX
@Main2:
        MOV     EAX,[ESI - 2*CLimbSize]
        ADC     EAX,[EDI - 2*CLimbSize]
        MOV     [EBX - 2*CLimbSize],EAX
@Main1:
        MOV     EAX,[ESI - CLimbSize]
        ADC     EAX,[EDI - CLimbSize]
        MOV     [EBX - CLimbSize],EAX
@DoRestLoop:

// etc...    

我删除了很多空白,我想读者可以得到其余的例程.它类似于主循环.速度提升约.较大的 BigInteger 为 20%,较小的 BigInteger 为 10%(只有几个肢体).

I removed a lot of white space, and I guess the reader can get the rest of the routine. It is similar to the main loop. A speed improvement of approx. 20% for larger BigIntegers, and some 10% for small ones (only a few limbs).

64 位版本现在尽可能使用 64 位加法(在主循环以及 Main3 和 Main2 中,它们不像上面那样落入"),而在此之前,64 位比 32 位慢很多,但现在它比 32 位快 30%,是原始简单 64 位循环的两倍.

The 64 bit version now uses 64 bit addition where possible (in the main loop and in Main3 and Main2, which are not "fall-through" like above) and before, 64 bit was quite a lot slower than 32 bit, but now it is 30% faster than 32 bit and twice as fast as the original simple 64 bit loop.

英特尔在其英特尔 64 位和 IA-32 架构优化参考手册中提出,3.5.2.6 Partial Flag Register Stalls -- Example 3-29:

        XOR     EAX,EAX

        .ALIGN  16

@MainLoop:

        ADD     EAX,[ESI]       // Sets all flags, so no partial flag register stall
        ADC     EAX,[EDI]       // ADD added in previous carry, so its result might have carry
        MOV     [EBX],EAX
        MOV     EAX,[ESI + CLimbSize]
        ADC     EAX,[EDI + CLimbSize]
        MOV     [EBX + CLimbSize],EAX
        MOV     EAX,[ESI + 2*CLimbSize]
        ADC     EAX,[EDI + 2*CLimbSize]
        MOV     [EBX + 2*CLimbSize],EAX
        MOV     EAX,[ESI + 3*CLimbSize]
        ADC     EAX,[EDI + 3*CLimbSize]
        MOV     [EBX + 3*CLimbSize],EAX
        SETC    AL              // Save carry for next iteration
        MOVZX   EAX,AL
        ADD     ESI,CUnrollIncrement*CLimbSize  // LEA has slightly worse latency
        ADD     EDI,CUnrollIncrement*CLimbSize
        ADD     EBX,CUnrollIncrement*CLimbSize
        DEC     ECX
        JNZ     @MainLoop

标志保存在AL中,通过MOVZX保存在EAX中.它是通过循环中的第一个 ADD 加入的.然后需要一个 ADC,因为 ADD 可能会产生一个进位.另见评论.

The flag is saved in AL, and through MOVZX in EAX. It is added in through the first ADD in the loop. Then an ADC is needed, because the ADD might generate a carry. Also see comments.

因为进位保存在EAX中,所以我也可以使用ADD来更新指针.循环中的第一个 ADD 也会更新所有标志,因此 ADC 不会受到部分标志寄存器停顿的影响.

Because the carry is saved in EAX, I can also use ADD to update the pointers. The first ADD in the loop also updates all flags, so ADC won't suffer from a partial flag register stall.

推荐答案

您在旧的 P6 系列 CPU 上看到的是部分标志停顿.
早期的 Sandybridge 系列更有效地处理合并,后来的 SnB 系列(例如 Skylake)根本没有合并成本:需要CF和SPAZO组的一些标志的uop将它们作为2个独立的输入读取.

What you're seeing on old P6-family CPUs is a partial-flag stall.
Early Sandybridge-family handles merging more efficiently, and later SnB-family (e.g. Skylake) has no merging cost at all: uops that need both CF and some flags from the SPAZO group read them as 2 separate inputs.

Intel CPU(P4 除外)分别重命名每个标志位,因此 JNE 仅取决于设置其使用的所有标志的最后一条指令(在这种情况下,仅 Z 标志).事实上,最近的 Intel CPU 甚至可以在内部将 inc/jne 组合成一个 inc-and-branch uop(宏融合).然而,当读取一个未被更新任何标志的最后一条指令修改的标志位时,问题就来了.

Intel CPUs (other than P4) rename each flag bit separately, so JNE only depends on the last instruction that set all the flags it uses (in this case, just the Z flag). In fact, recent Intel CPUs can even internally combine an inc/jne into a single inc-and-branch uop (macro-fusion). However, the trouble comes when reading a flag bit that was left unmodified by the last instruction that updated any flags.

Agner Fog 表示英特尔 CPU(甚至 PPro/PII)不会在 inc 上停滞/jnz.实际上不是 inc/jnz 停顿了,而是下一次迭代中的 adc 必须读取 inc 之后的 CF 标志 编写了其他标志,但未修改 CF.

Agner Fog says Intel CPUs (even PPro / PII) don't stall on inc / jnz. It's not actually the inc/jnz that's stalling, it's the adc in the next iteration that has to read the CF flag after inc wrote other flags but left CF unmodified.

; Example 5.21. Partial flags stall when reading unmodified flag bits
cmp eax, ebx
inc ecx
jc xx
; Partial flags stall  (P6 / PIII / PM / Core2 / Nehalem)

Agner Fog 还更笼统地说:避免使用依赖于 INC 或 DEC 保持进位标志不变这一事实的代码."(对于奔腾 M/Core2/Nehalem).完全避免 inc/dec 的建议已过时,仅适用于 P4.其他CPU分别重命名EFLAGS的不同部分,只有在需要合并时才会出现问题(读取一个未被上次insn修改的标志写入任何标志).

Agner Fog also says more generally: "Avoid code that relies on the fact that INC or DEC leaves the carry flag unchanged." (for Pentium M/Core2/Nehalem). The suggestion to avoid inc/dec entirely is obsolete, and only applied to P4. Other CPUs rename different parts of EFLAGS separately, and only have trouble when merging is required (reading a flag that was unmodified by the last insn to write any flags).

在速度很快的机器上(Sandybridge 及更高版本),当您读取修改它的最后一条指令未写入的位时,它们会插入一个额外的 uop 来合并标志寄存器.这比停顿 7 个周期要快得多,但仍然不理想.

On the machines where it's fast (Sandybridge and later), they're inserting an extra uop to merge the flags register when you read bits that weren't written by the last instruction that modified it. This is much faster than stalling for 7 cycles, but still not ideal.

P4 总是跟踪整个寄存器,而不是重命名部分寄存器,甚至不重命名 EFLAGS.所以 inc/jz 有一个false"依赖于在其之前写入标志的任何内容.这意味着循环条件无法检测到循环的结束,直到 adc dep 链的执行到达那里,所以当循环分支停止被采用时可能发生的分支错误预测不能早发现.不过,它确实可以防止任何部分标志停顿.

P4 always tracks whole registers, instead of renaming partial registers, not even EFLAGS. So inc/jz has a "false" dependency on whatever wrote the flags before it. This means that the loop condition can't detect the end of the loop until execution of the adc dep chain gets there, so the branch mispredict that can happen when the loop-branch stops being taken can't be detected early. It does prevent any partial-flags stalls, though.

你的 lea/jecxz 很好地避免了这个问题.它在 SnB 上较慢,后来因为您根本没有展开循环.您的 LEA 版本是 11 uops(可以每 3 个周期发出一次迭代),而 inc 版本是 7 uops(可以每 2 个周期发出一次迭代),不包括它插入的标志合并 uop拖延.

Your lea / jecxz avoids the problem nicely. It's slower on SnB and later because you didn't unroll your loop at all. Your LEA version is 11 uops (can issue one iteration per 3 cycles), while the inc version is 7 uops (can issue one iter per 2 cycles), not counting the flag-merging uop it inserts instead of stalling.

如果 循环指令并不慢,这将是完美的.它在 AMD Bulldozer 系列(1 m-op,与融合的比较和分支的成本相同)和 Via Nano3000 上实际上很快.不过,这在所有英特尔 CPU 上都很糟糕(在 SnB 系列上为 7 uop).

If the loop instruction wasn't slow, it would be perfect for this. It's actually fast on AMD Bulldozer-family (1 m-op, same cost as a fused compare-and-branch), and Via Nano3000. It's bad on all Intel CPUs, though (7 uops on SnB-family).

展开时,您可以通过使用指针而不是索引寻址模式获得另一个小收益,因为 2-reg 寻址模式无法在 SnB 及更高版本上进行微熔断.一组load/adc/store指令没有微融合是6个uop,有微融合只有4个.CPU 可以发出 4 个融合域 uops/clock.(有关此级别的详细信息,请参阅 Agner Fog 的 CPU 微架构文档和指令表.)

When you unroll, you can get another small gain from using pointers instead of indexed addressing modes, because 2-reg addressing modes can't micro-fuse on SnB and later. A group of load/adc/store instructions is 6 uops without micro-fusion, but only 4 with micro-fusion. CPUs can issue 4 fused-domain uops/clock. (See Agner Fog's CPU microarch doc, and instruction tables, for details on this level.)

尽可能保存 uops,以确保 CPU 发出指令的速度比执行速度快,以确保它可以在指令流中看到足够远的地方,以吸收 insn 提取中的任何气泡(例如分支错误预测).适合 28uop 循环缓冲区也意味着省电(在 Nehalem 上,避免指令解码瓶颈.)有指令对齐和跨越 uop 缓存线边界之类的事情,这使得在没有循环的情况下很难维持完整的 4 uop/时钟缓冲区也是.

Save uops when you can to make sure the CPU can issue instructions faster than execute, to make sure it can see far enough ahead in the instruction stream to absorb any bubbles in insn fetch (e.g. branch mispredict). Fitting in the 28uop loop buffer also means power savings (and on Nehalem, avoiding instruction-decoding bottlenecks.) There are things like instruction alignment and crossing uop cache-line boundaries that make it hard to sustain a full 4 uops / clock without the loop buffer, too.

另一个技巧是保持指向缓冲区末尾的指针,并向上计数为零.(因此,在循环开始时,您将获得第一项为 end[-idx].)

Another trick is to keep pointers to the end of your buffers, and count up towards zero. (So at the start of your loop, you get the first item as end[-idx].)

        ; pure loads are always one uop, so we can still index it
        ; with no perf hit on SnB
        add     esi, ecx   ; point to end of src1
        neg     ecx

UNROLL equ 4
@MainLoop:
        MOV     EAX, [ESI + 0*CLimbSize + ECX*CLimbSize]
        ADC     EAX, [EDI + 0*CLimbSize]
        MOV     [EBX + 0*CLimbSize], EAX

        MOV     EAX, [ESI + 1*CLimbSize + ECX*CLimbSize]
        ADC     EAX, [EDI + 1*CLimbSize]
        MOV     [EBX + 1*CLimbSize], EAX

        ; ... repeated UNROLL times.  Use an assembler macro to repeat these 3 instructions with increasing offsets

        LEA     ECX, [ECX+UNROLL] ; loop counter

        LEA     EDI, [EDI+ClimbSize*UNROLL]  ; Unrolling makes it worth doing
        LEA     EBX, [EBX+ClimbSize*UNROLL]  ; a separate increment to save a uop for every ADC and store on SnB & later.

        JECXZ   @DoRestLoop                     // LEA does not modify Zero flag, so JECXZ is used.
        JMP     @MainLoop
@DoRestLoop:

展开 4 应该不错.没必要过分,因为你是概率.将能够使 pre-Haswell 的加载/存储端口饱和,只需展开 3 或 4 个,甚至可能是 2 个.

An unroll of 4 should be good. No need to overdo it, since you're prob. going to be able to saturate the load/store ports of pre-Haswell with an unroll of just 3 or 4, maybe even 2.

展开 2 将使上述循环恰好为 Intel CPU 的 14 个融合域 uops.adc 为 2 ALU(+1 融合内存),jecxz 为 2,其余(包括 LEA)均为 1.在未融合域中,10 个 ALU/分支和 6内存(好吧,如果你真的把存储地址和存储数据分开计算的话,8 个内存).

An unroll of 2 will make the above loop exactly 14 fused-domain uops for Intel CPUs. adc is 2 ALU (+1 fused memory), jecxz is 2, the rest (including LEA) are all 1. In the unfused domain, 10 ALU/branch and 6 memory (well, 8 memory if you really count store-address and store-data separately).

  • 每次迭代 14 个融合域 uops:每 4 个时钟发出一次迭代.(最后的奇数 2 uop 必须作为一组 2 发出,甚至来自循环缓冲区.)
  • 10 ALU &分支 uops:需要 3.33c 在 pre-haswell 上执行它们.我不认为任何一个端口都会成为瓶颈:adc 的 uops 可以在任何端口上运行,而 lea 可以在 p0/p1 上运行.跳转使用 port5(jecx 也使用 p0/p1 之一)
  • 6 次内存操作:在 Pre-Haswell CPU 上执行需要 3c,每个时钟可以处理 2 个.Haswell 为商店添加了一个专用的 AGU,因此它可以维持 2load+1store/clock.
  • 14 fused-domain uops per iteration: issue one iteration per 4 clocks. (The odd 2 uops at the end have to issue as a group of 2, even from the loop buffer.)
  • 10 ALU & branch uops: Takes 3.33c to execute them all on pre-haswell. I don't think any one port will be a bottleneck, either: adc's uops can run on any port, and lea can run on p0/p1. The jumps use port5 (and jecx also uses one of p0/p1)
  • 6 memory operations: Takes 3c to execute on pre-Haswell CPUs, which can handle 2 per clock. Haswell added a dedicated AGU for stores so it can sustain 2load+1store/clock.

因此对于 pre-haswell CPU,使用 LEA/JECXZ,展开 2 不会使 ALU 或加载/存储端口完全饱和.展开 4 将使其达到 22 个融合 uops(要发出 6 个周期).14 ALU&branch:4.66c 执行.12 内存:6 个周期执行.因此,展开 4 将使 pre-Haswell CPU 饱和,但只是勉强.CPU 不会有任何指令缓冲区来处理分支预测错误.

So for pre-haswell CPUs, using LEA/JECXZ, an unroll of 2 won't quite saturate either the ALU or the load/store ports. An unroll of 4 will bring it up to 22 fused uops (6 cycles to issue). 14 ALU&branch: 4.66c to execute. 12 memory: 6 cycles to execute. So an unroll of 4 will saturate pre-Haswell CPUs, but only just barely. The CPU won't have any buffer of instructions to churn through on a branch mispredict.

Haswell 和更高版本将始终在前端遇到瓶颈(每个时钟限制 4 uop),因为 load/adc/store 组合需要 4 个 uops,并且每个时钟可以维持一个 uops.所以永远没有任何空间"for 循环开销不会减少 adc 吞吐量.这是你必须知道不要过度和展开太多的地方.

Haswell and later will always be bottlenecked on the frontend (4 uops per clock limit), because the load/adc/store combo takes 4 uops, and can be sustained at one per clock. So there's never any "room" for loop overhead without cutting into adc throughput. This is where you have to know not to overdo it and unroll too much.

在 Broadwell/Skylake 上,adc 只是具有 1c 延迟的单个 uop,并且加载/adc r, m/store 似乎是最好的序列. adc m, r/i 是 4 uops.这应该可以维持每个时钟一个 adc,就像 AMD 一样.

On Broadwell / Skylake, adc is only a single uop with 1c latency, and load / adc r, m / store appears to be the best sequence. adc m, r/i is 4 uops. This should sustain one adc per clock, like AMD.

在 AMD CPU 上,adc 只是一个宏操作,所以如果 CPU 可以维持 4 的问题率(即没有解码瓶颈),那么他们也可以使用 2 负载/1存储端口击败哈斯韦尔.此外,AMD 上的 jecxz 与任何其他分支一样高效:只有一个宏操作.多精度数学是 AMD CPU 擅长的少数事情之一.某些整数指令的较低延迟使它们在某些 GMP 例程中具有优势.

On AMD CPUs, adc is only one macro-op, so if the CPU can sustain an issue rate of 4 (i.e. no decoding bottlenecks), then they can also use their 2 load / 1 store port to beat Haswell. Also, jecxz on AMD is as efficient as any other branch: only one macro-op. Multi-precision math is one of the few things AMD CPUs are good at. Lower latencies on some integer instructions give them an advantage in some GMP routines.

展开超过 5 可能会影响 Nehalem 的性能,因为这会使循环大于 28uop 循环缓冲区.然后指令解码会将您限制在每个时钟少于 4 uop.在更早的版本(Core2)上,有一个 64B 的 x86 指令循环缓冲区(64B 的 x86 代码,而不是 uops),这有助于一些解码.

An unroll of more than 5 might hurt performance on Nehalem, because that would make the loop bigger than the 28uop loop buffer. Instruction decoding would then limit you to less than 4 uops per clock. On even earlier (Core2), there's a 64B x86-instruction loop buffer (64B of x86 code, not uops), which helps some with decode.

除非这个 adc 例程是您应用程序中的唯一瓶颈,否则我会将展开因子降低到 2.或者甚至不展开,如果这样可以节省大量序言/尾声代码,并且您的 BigInts 不会太大.当调用者调用许多不同的 BigInteger 函数(例如 add、sub、mul 以及在两者之间执行其他操作时),您不想过多地膨胀代码并创建缓存未命中.如果您的程序没有在每次调用的内部循环中花费很长时间,那么展开太多以在微基准测试中获胜可能会令您措手不及.

Unless this adc routine is the only bottleneck in your app, I'd keep the unroll factor down to maybe 2. Or maybe even not unroll, if that saves a lot of prologue / epilogue code, and your BigInts aren't too big. You don't want to bloat the code too much and create cache misses when callers call lots of different BigInteger functions, like add, sub, mul, and do other things in between. Unrolling too much to win at microbenchmarks can shoot yourself in the foot if your program doesn't spend a long time in your inner loop on each call.

如果您的 BigInt 值通常不是很大,那么您需要调整的不仅仅是循环.较小的展开可能有助于简化序言/尾声逻辑.确保你检查了长度,这样 ECX 就不会在没有为零的情况下过零,当然.这是展开和向量的麻烦.:/

If your BigInt values aren't usually gigantic, then it's not just the loop you have to tune. A smaller unroll could be good to simplify the prologue/epilogue logic. Make sure you check lengths so ECX doesn't cross zero without ever being zero, of course. This is the trouble with unrolling and vectors. :/

这可能是最有效的方式:

This might be the most efficient way:

lahf
# clobber flags
sahf              ; cheap on AMD and Intel.  This doesn't restore OF, but we only care about CF

# or

setc al
# clobber flags
add  al, 255      ; generate a carry if al is non-zero

使用与 adc dep 链相同的寄存器实际上不是问题:eax 将始终与最后一个 CF 输出的同时准备好代码>adc.(在 AMD 和 P4/Silvermont 上,partial-reg writes 对完整 reg 有一个错误的依赖.他们不会单独重命名部分 reg).保存/恢复是 adc dep 链的一部分,而不是循环条件 dep 链.

Using the same register as the adc dep chain isn't actually a problem: eax will always be ready at the same time as the CF output from the last adc. (On AMD and P4/Silvermont partial-reg writes have a false dep on the full reg. They don't rename partial regs separately). The save/restore is part of the adc dep chain, not the loop condition dep chain.

循环条件只检查由 cmpsubdec 写入的标志.保存/恢复它周围的标志不会使它成为 adc dep 链的一部分,因此可以在 adc 执行到达那里之前检测到循环末尾的分支错误预测.(此答案的先前版本弄错了.)

The loop condition only checks flags written by cmp, sub, or dec. Saving/restoring flags around it doesn't make it part of the adc dep chain, so the branch mispredict at the end of the loop can be detected before adc execution gets there. (A previous version of this answer got this wrong.)

几乎可以肯定有一些空间可以在设置代码中删除指令,可能是通过使用值开始的寄存器.您不必将 edi 和 esi 用于指针,尽管我知道当您以与其传统"一致的方式使用寄存器时,它会使初始开发更容易.用.(例如 EDI 中的目标指针).

There's almost certainly some room to shave off instructions in the setup code, maybe by using registers where values start. You don't have to use edi and esi for pointers, although I know it makes initial development easier when you're using registers in ways consistent with their "traditional" use. (e.g. destination pointer in EDI).

Delphi 是否允许您使用 ebp?很高兴有第 7 个注册.

Does Delphi let you use ebp? It's nice to have a 7th register.

显然 64 位代码将使您的 BigInt 代码运行速度大约快两倍,即使您必须担心在 64 位 adc 循环结束时执行单个 32b adc.它还会给你 2 倍的寄存器数量.

Obviously 64bit code would make your BigInt code run about twice as fast, even though you'd have to worry about doing a single 32b adc at the end of a loop of 64bit adc. It would also give you 2x the amount of registers.

这篇关于某些 CPU 上紧密循环中的 ADC/SBB 和 INC/DEC 问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆