x86汇编 - 夹紧RAX到优化[0 ..上限) [英] x86 assembly - optimization of clamping rax to [ 0 .. limit )

查看:503
本文介绍了x86汇编 - 夹紧RAX到优化[0 ..上限)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在写一个简单的汇编程序,该程序,自然,目的是尽可能快。然而,某一部分,它位于最嵌套的循环,似乎并没有正确的,我相信这是可能拿出更聪明,更快的实现,即使不使用条件跳转可能。在code实现了一个简单的事情:

I'm writing a simple assembler procedure, which, naturally, aims to be as quick as possible. However, a certain part, which is located in the most nested loop, doesn't seem 'right' and I believe it is possible to come up with cleverer and quicker implementation, maybe even without using conditional jumps. The code implements a simple thing:

如果RAX< 0,则
    RAX:= 0
否则,如果RAX> = R12则
    RAX:= R12 - 1

这是我的幼稚的做法:

cmp rax, 0
jge offsetXGE
   mov rax, 0
   jmp offsetXReady
offsetXGE:
   cmp rax, r12
   jl offsetXReady
   mov rax, r12
   dec rax
offsetXReady:

任何想法是受欢迎的,即使是那些使用MMX和一些遮盖技巧。

Any ideas are welcome, even those using MMX and some masking tricks.

编辑:为了回答一些评论的问题 - 是的,我们可以假设,R12> 0,但RAX可以是负的。

To answer some questions in comments - yes we can assume that r12 > 0 but rax can be negative.

推荐答案

这是不值得动标量数据的载体暂存器一个或两个指令,然后再移回。如果你能有效地做一次整向量,那么你可以使用 PMINSD / PMAXSD 来钳制值这样的范围。

One compare-and-branch, and an LEA + cmov.


It's not worth moving scalar data to the vector regs for one or two instructions, and then moving it back. If you could usefully do whole vectors at a time, then you could use PMINSD/PMAXSD to clamp values to a range like this.

在原来的,有两件事情显然最佳的。前两个只有物质为code尺寸的大部分时间,但 LEA 的非破坏性加是一个小而明确的胜利:

In your original, a couple things are clearly sub-optimal. The first two only matter for code-size most of the time, but LEA for a non-destructive add is a small but clear win:

MOV RAX,0 should为 XOR EAX,EAX 。不, EAX 不是一个错字 RAX

MOV RAX,R12 / DEC RAX LEA RAX,[R12 - 1]。

请参见 86 维基链接, ESP。 瓦格纳雾指南的。

See the links in the x86 wiki, esp. Agner Fog's guides.

周围的一些搜索后,我发现<一个href=\"http://$c$creview.stackexchange.com/questions/6502/fastest-way-to-clamp-an-integer-to-the-range-0-255\">a关于最佳的x86汇编夹紧一系列类似的问题。我从一些灵感,但大多与CMOV而不是 setcc / DEC /和

After searching around some, I found a similar question about optimal x86 asm for clamping to a range. I got some inspiration from that, but mostly rewrote it with cmov instead of setcc/dec/and.

您需要一个寄存器(​​或存储器位置)控股 0 ,否则额外的指令 MOV章,0

You need a register (or memory location) holding 0, or else an extra instruction to mov reg, 0.

    ...
    cmp  rax, r12
    jae  .clamp      ; favour the fast-path more heavily by making it the not-taken case
.clamp_finished:     ; rdx is clobbered, since the clamp code uses a scratch reg

    ...
    ret

.clamp:   
    ; flags still set from the cmp rax, r12
    ; we only get here if rax is >= r12 (`ge` signed compare), or negative (`l` rax < r12, signed)

    ; mov r15d, 0        ; or zero it outside the loop so it can be used when needed.  Can't xor-zero because we need to preserve flags

    lea    rax, [r12-1]  ; still doesn't modify flags
    cmovl  eax, r15d     ; rax=0 if  orig_rax<r12 (signed), which means we got here because orig_rax<0
    jmp  .clamp_finished

有关英特尔Haswell的快速​​PERF分析:

quick perf analysis for Intel Haswell:


  • 快速路径:一是不采取比较和分支UOP。延时RAX:0次。

  • Fast path: one not-taken compare-and-branch uop. Latency for rax: 0 cycles.

夹紧需要的情况下:一是采取比较和分支UOP,外加4个微指令(LEA,2 CMOV,1更JMP回来。)延时RAX:从RAX后来的3个周期和R12(CMP->标记,flags-> CMOV)。

Clamping-needed case: One taken compare-and-branch uop, plus 4 more uops (lea, 2 for cmov, 1 more to jmp back.) Latency for rax: 3 cycles from the later of rax and r12 (cmp->flags, flags->cmov).

显然,你可以使用JB 的而不是来跳过夹紧 LEA / CMOV ,而不是拉出来的主要流程。请参阅下面的动机的部分。 (和/或看到Anatolyg的出色答卷,其中涵盖这一点。我用 JB [0 ..上限]的酷技巧从Anatolyg的回答一个分支,也是如此)。

Obviously you can use jb instead of jae to skip over the clamping lea/cmov, instead of pulling them out of the main flow. See the section below for motivation for that. (And/or see Anatolyg's excellent answer, which covers this. I got the cool trick of using jb to do the [0 .. limit] with one branch from Anatolyg's answer, too).

我觉得这里的版本使用CMOV是最好的选择,即使CMOV有很多的缺点而不是总是快。它的输入操作数已经需要,所以它并没有太多的延迟(除了在夹到零的情况下,用树枝,见下文)。

I think the version using cmov is the best bet here, even though cmov has a lot of downsides and isn't always faster. Its input operands were already needed, so it's not adding much latency (except in the clamp-to-zero case with branches, see below).

这是另一种枝实施 .clamp code,这并不需要一个归零寄存器将是:

An alternative branchy implementation of the .clamp code that doesn't need a zeroed-register would be:

.clamp:
    lea    rax, [r12-1]
    jge  .clamp_finished
    xor    eax, eax
    jmp  .clamp_finished

它仍然计算结果可能扔掉,CMOV风格。但是,下面的异或启动一个新的依存关系链,因此它不必等待 LEA 来写 RAX

这是重要的问题是,你希望多久采取这些分支。如果有一个常见的​​情况(例如无夹紧的情况下),使该通过code中的快速路径(如一些指令,尽可能少采取支行越好)。根据分行如何采取很少,也可以是值得投入的code为不常见的情况下关闭在函数的结尾。

An important question is how often you expect these branches to be taken. If there's a common case (e.g. the no-clamping case), make that the fast-path through the code (as few instructions and as few taken-branches as possible). Depending on how infrequently branches are taken, it can be worth putting the code for the uncommon case off at the end of the function.

func:
    ...
    test
    jcc .unlikely
    ...        
.ret_from_unlikely:
    ...
    ... ;; lots of code
    ret

.unlikely:
    xor   eax,eax
    jmp .ret_from_unlikely   ;; this extra jump makes the slow path slower, but that's worth it to make the fast path faster.

gcc的做到这一点,我认为当它决定一个分支是不可能被采取。因此,而不是具有典型的案例采取跳过某些指令的一个分支,通常情况下落空。通常情况下,默认分支prediction是不采取向前跳跃,所以这压根就没需要一个科顺predictor进入,直到它看到的不太可能的情况下。

Gcc does this, I think when it decides a branch is unlikely to be taken. So instead of having the typical case take a branch that skips some instructions, the common case falls through. Typically, the default branch prediction is not-taken for forward jumps, so this never even needs a branch-predictor entry until it sees the unlikely case.

随想:在code

if (eax < 0) { eax = 0; }
else if (eax >= r12) { eax := r12 - 1 }  // If r12 can be zero, the else matters

等同于

eax = min(eax, r12-1);
eax = max(eax, 0);

12版不能为负,但OP没有说这是不可能为零。这种排序preserves的的if / else语义。 (编辑:其实OP没有说你可以假设R12> 0,而不是> = 0)。如果我们在一个ASM快速最小值/最大值,我们可以在这里使用它。矢量-MAX是一款单指令,但标需要花费更多的code。

r12 can't be negative, but OP didn't say it couldn't be zero. This ordering preserves the if/else semantics. (edit: actually OP did say you can assume r12>0, not >=0.) If we had a fast min/max in asm, we could use it here. vector-max is a single-instruction, but scalar takes more code.

这篇关于x86汇编 - 夹紧RAX到优化[0 ..上限)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆