x86汇编 - 夹紧RAX到优化[0 ..上限） [英] x86 assembly - optimization of clamping rax to [ 0 .. limit )

查看：503 发布时间：2016/7/18 21:40:11 assembly optimization x86 nasm

本文介绍了x86汇编 - 夹紧RAX到优化[0 ..上限）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在写一个简单的汇编程序，该程序，自然，目的是尽可能快。然而，某一部分，它位于最嵌套的循环，似乎并没有正确的，我相信这是可能拿出更聪明，更快的实现，即使不使用条件跳转可能。在code实现了一个简单的事情：

I'm writing a simple assembler procedure, which, naturally, aims to be as quick as possible. However, a certain part, which is located in the most nested loop, doesn't seem 'right' and I believe it is possible to come up with cleverer and quicker implementation, maybe even without using conditional jumps. The code implements a simple thing:

如果RAX＆LT; 0，则 RAX：= 0 否则，如果RAX＆GT; = R12则 RAX：= R12 - 1

这是我的幼稚的做法：

cmp rax, 0
jge offsetXGE
   mov rax, 0
   jmp offsetXReady
offsetXGE:
   cmp rax, r12
   jl offsetXReady
   mov rax, r12
   dec rax
offsetXReady:

任何想法是受欢迎的，即使是那些使用MMX和一些遮盖技巧。

Any ideas are welcome, even those using MMX and some masking tricks.

编辑：为了回答一些评论的问题 - 是的，我们可以假设，R12> 0，但RAX可以是负的。

To answer some questions in comments - yes we can assume that r12 > 0 but rax can be negative.

One compare-and-branch, and an LEA + cmov.

It's not worth moving scalar data to the vector regs for one or two instructions, and then moving it back. If you could usefully do whole vectors at a time, then you could use PMINSD/PMAXSD to clamp values to a range like this.

在原来的，有两件事情显然最佳的。前两个只有物质为code尺寸的大部分时间，但 LEA 的非破坏性加是一个小而明确的胜利：

In your original, a couple things are clearly sub-optimal. The first two only matter for code-size most of the time, but LEA for a non-destructive add is a small but clear win:

CMP EAX，0 的应 TEST EAX，EAX

MOV RAX，0 should为 XOR EAX，EAX 。不， EAX 不是一个错字 RAX 。

MOV RAX，R12 / DEC RAX 的应 LEA RAX，[R12 - 1]。

请参见 86 维基链接， ESP。瓦格纳雾指南的。

See the links in the x86 wiki, esp. Agner Fog's guides.

周围的一些搜索后，我发现<一个href=\"http://$c$creview.stackexchange.com/questions/6502/fastest-way-to-clamp-an-integer-to-the-range-0-255\">a关于最佳的x86汇编夹紧一系列类似的问题。我从一些灵感，但大多与CMOV而不是 setcc / DEC /和。

After searching around some, I found a similar question about optimal x86 asm for clamping to a range. I got some inspiration from that, but mostly rewrote it with cmov instead of setcc/dec/and.

您需要一个寄存器（或存储器位置）控股 0 ，否则额外的指令 MOV章，0 。

You need a register (or memory location) holding 0, or else an extra instruction to mov reg, 0.

    ...
    cmp  rax, r12
    jae  .clamp      ; favour the fast-path more heavily by making it the not-taken case
.clamp_finished:     ; rdx is clobbered, since the clamp code uses a scratch reg

    ...
    ret

.clamp:   
    ; flags still set from the cmp rax, r12
    ; we only get here if rax is >= r12 (`ge` signed compare), or negative (`l` rax < r12, signed)

    ; mov r15d, 0        ; or zero it outside the loop so it can be used when needed.  Can't xor-zero because we need to preserve flags

    lea    rax, [r12-1]  ; still doesn't modify flags
    cmovl  eax, r15d     ; rax=0 if  orig_rax<r12 (signed), which means we got here because orig_rax<0
    jmp  .clamp_finished

有关英特尔Haswell的快速PERF分析：

quick perf analysis for Intel Haswell:

快速路径：一是不采取比较和分支UOP。延时RAX：0次。

Fast path: one not-taken compare-and-branch uop. Latency for rax: 0 cycles.

夹紧需要的情况下：一是采取比较和分支UOP，外加4个微指令（LEA，2 CMOV，1更JMP回来。）延时RAX：从RAX后来的3个周期和R12（CMP->标记，flags-> CMOV）。

Clamping-needed case: One taken compare-and-branch uop, plus 4 more uops (lea, 2 for cmov, 1 more to jmp back.) Latency for rax: 3 cycles from the later of rax and r12 (cmp->flags, flags->cmov).

显然，你可以使用JB 的而不是宰来跳过夹紧 LEA / CMOV ，而不是拉出来的主要流程。请参阅下面的动机的部分。（和/或看到Anatolyg的出色答卷，其中涵盖这一点。我用 JB 做 [0 ..上限]的酷技巧从Anatolyg的回答一个分支，也是如此）。


Obviously you can use jb instead of jae to skip over the clamping lea/cmov, instead of pulling them out of the main flow.  See the section below for motivation for that.  (And/or see Anatolyg's excellent answer, which covers this.  I got the cool trick of using jb to do the [0 .. limit] with one branch from Anatolyg's answer, too).
我觉得这里的版本使用CMOV是最好的选择，即使CMOV有很多的缺点而不是总是快。它的输入操作数已经需要，所以它并没有太多的延迟（除了在夹到零的情况下，用树枝，见下文）。
I think the version using cmov is the best bet here, even though cmov has a lot of downsides and isn't always faster.  Its input operands were already needed, so it's not adding much latency (except in the clamp-to-zero case with branches, see below).
这是另一种枝实施 .clamp  code，这并不需要一个归零寄存器将是：
An alternative branchy implementation of the .clamp code that doesn't need a zeroed-register would be:
.clamp:
    lea    rax, [r12-1]
    jge  .clamp_finished
    xor    eax, eax
    jmp  .clamp_finished

它仍然计算结果可能扔掉，CMOV风格。但是，下面的异或启动一个新的依存关系链，因此它不必等待 LEA 来写 RAX  
这是重要的问题是，你希望多久采取这些分支。如果有一个常见的情况（例如无夹紧的情况下），使该通过code中的快速路径（如一些指令，尽可能少采取支行越好）。根据分行如何采取很少，也可以是值得投入的code为不常见的情况下关闭在函数的结尾。
An important question is how often you expect these branches to be taken.  If there's a common case (e.g. the no-clamping case), make that the fast-path through the code (as few instructions and as few taken-branches as possible).  Depending on how infrequently branches are taken, it can be worth putting the code for the uncommon case off at the end of the function.
func:
    ...
    test
    jcc .unlikely
    ...        
.ret_from_unlikely:
    ...
    ... ;; lots of code
    ret

.unlikely:
    xor   eax,eax
    jmp .ret_from_unlikely   ;; this extra jump makes the slow path slower, but that's worth it to make the fast path faster.

 gcc的做到这一点，我认为当它决定一个分支是不可能被采取。因此，而不是具有典型的案例采取跳过某些指令的一个分支，通常情况下落空。通常情况下，默认分支prediction是不采取向前跳跃，所以这压根就没需要一个科顺predictor进入，直到它看到的不太可能的情况下。
Gcc does this, I think when it decides a branch is unlikely to be taken.  So instead of having the typical case take a branch that skips some instructions, the common case falls through.  Typically, the default branch prediction is not-taken for forward jumps, so this never even needs a branch-predictor entry until it sees the unlikely case.
随想：在code 
if (eax < 0) { eax = 0; }
else if (eax >= r12) { eax := r12 - 1 }  // If r12 can be zero, the else matters

等同于
eax = min(eax, r12-1);
eax = max(eax, 0);

  12版不能为负，但OP没有说这是不可能为零。这种排序preserves的的if / else语义。 （编辑：其实OP没有说你可以假设R12> 0，而不是> = 0）。如果我们在一个ASM快速最小值/最大值，我们可以在这里使用它。矢量-MAX是一款单指令，但标需要花费更多的code。
r12 can't be negative, but OP didn't say it couldn't be zero.  This ordering preserves the if/else semantics.  (edit: actually OP did say you can assume r12>0, not >=0.)  If we had a fast min/max in asm, we could use it here.  vector-max is a single-instruction, but scalar takes more code.

                        这篇关于x86汇编 - 夹紧RAX到优化[0 ..上限）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

x86汇编 - 夹紧RAX到优化[0 ..上限） [英] x86 assembly - optimization of clamping rax to [ 0 .. limit )

问题描述

推荐答案

One compare-and-branch, and an LEA + cmov.

相关文章

.NET Framework最新文章

热门教程

热门工具

登录关闭

x86汇编 - 夹紧RAX到优化[0 ..上限） [英] x86 assembly - optimization of clamping rax to [ 0 .. limit )

问题描述

推荐答案

One compare-and-branch, and an LEA + cmov.

相关文章

.NET Framework最新文章

热门教程

热门工具

登录 关闭

登录关闭