使用循环展开计数正,负,零数的最有效的方法 [英] The most efficient way of counting positive, negative and zero number using loop unrolling

查看:386
本文介绍了使用循环展开计数正,负,零数的最有效的方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说我有下面的指令,只检查一个数字是正还是没有(负或零),如果是正面加1到我们的柜台(我们不关心数字为负或零)。我可以用一个简单的循环展开做到这一点:

 循环:
    mrmovq(%RDI),R10%#读VAL [0] SRC
    和Q%R10,%R10#VAL [0]&下; = 0?
    JLE Npos1#如果是这样,转到的非营利组织:
    iaddq $ 1,RAX%#伯爵++Npos1:
    mrmovq 8(%RDI),R11%#读VAL [1]从SRC + 8
    和Q%R11,%R11#VAL&下; = 0?
    JLE Npos2#如果是这样,转到下一个非营利组织:
    iaddq $ 1,RAX%Npos2:
    mrmovq 16(%RDI),R11%#读VAL [2] SRC + 8
    和Q%R11,%R11#VAL&下; = 0?
    JLE Npos3#如果是这样,转到下一个非营利组织:
    iaddq $ 1,RAX%

我的问题是我怎么能得到同样高效的结构,如果我还想来检查为零或负值。在这种情况下,我将有三个计数器(一个用于POS机,一个用于负,一个是零)
一种低效的方式会是这样。我试图让相同的结构previous例子(我们在%RAX 存储正数,在%RBX 和零点%RCX

 循环:mrmovq(%RDI),R10%来自#阅读的src ... VAL
        和Q%R10,%R10#VAL&下; = 0?
        JLE的非营利组织#如果是这样,转到的非营利组织:
        irmovq $ 1,R11%
        addq%R11,%#RAX在伯爵RAX阳性 - count_pos ++
        JMP休息
非营利组织:和Q%R10,R10%#不积极
        JE零
        irmovq $ 1,R11%
        addq%R11,%#RBX在伯爵底片RBX - count_neg ++
        JMP休息
零:irmovq $ 1,%R11
        addq%R11,%#RCX在伯爵RCX零 - count_zero ++
休息:irmovq $ 1,R10%
        SUBQ%R10,%#RDX len--
        irmovq $ 8%R10
        addq%R10,%RDI#SRC ++
        addq%R10,%#RSI DST ++
        和Q%的RDX,RDX%#&LEN GT; 0?
        JG回路#如果是这样,转到循环:


解决方案

更新:看到快结束的一个非分支版本,应该是的的更好,琐碎展开。但答案的其余部分仍然是值得一读,海事组织。

我没有找到一个方法来拯救每测试值执行,一个展开一对夫妇的说明,但比起我与使用循环尾复制一个精心优化的版本管理是pretty较小。 (见下文)。


Y86过于精简,以便有效code相比,在很多情况下真正的架构。一方面,似乎没有成为一个方式没有弄错国旗有条件递增。 (X86具有 LEA RAX,[RAX + 1] )。

我不明白的方式由只计算正,零,在循环后计算从负计数节省大量的指令。你还是要转移到测试每个值。 更新:不,你不这样做,因为你可以模拟86的 setcc 使用Y86的 CMOV


不过,我也发现在你的code几大改进:


  • 重用第一个测试设置的标志,而不是重新测试


  • 另一个主要的是要扯起%R11 = 1 圈外,所以你只需一个insn的增加。在寄存器中设置常量是即使在现实code一个非常常见的事情。大多数的ISA(包括RISC加载存储设备)有加,立即作出指示,像86的加$ 1%RAX ,但Y86不那么它甚至需要此技术的增量(86 INC%RAX )!


  • 设置标志,所以不是用它们做一个单独的测试。


作风问题:

使用描述性标签的名字,你并不需要尽可能多的评论。

此外,缩进操作数一致列,而不是变长助记符后只是一个单一的空间。这是更具可读性。我喜欢缩进分支目标少,使他们脱颖而出,但有这么多的分公司,在此code,它实际上只是看起来丑陋:/

  irmovq $ 1,%#R11吊出循环
        irmovq $ 8%R8循环:mrmovq(%RDI),R10%来自#阅读的src ... VAL
        距离Val和Q%R10,R10%#组标志
        JLE not_positive
        addq%R11,%#RAX在伯爵RAX阳性 - count_pos ++
        JMP休息
not_positive:
        距离Val JE零#标志仍
        addq%R11,%#RBX在伯爵底片RBX - count_neg ++
        JMP休息
零:
        addq%R11,%#RCX在伯爵RCX零 - count_zero ++
休息:
        addq%R8,%RDI#SRC + = 8
        // addq%R8,RSI%#DST ++ //为什么呢?未使用...还要注意的是SI表示源索引,因此preFER保持在RSI指针SRC,和dest的指针在RDI易读。
        SUBQ%R11,%#RDX len--,设置标志
        JG回路#},而(len--→1)。砸锅当RDX = 0


环尾复制:

您可以增加code大小,但减少,实际上通过复制环尾运行指令数。

我也重组了循环,这样,只有一个循环体的分支每次迭代。

  irmovq $ 1,%#R11吊出循环
        irmovq $ 8%R8循环:mrmovq(%RDI),R10%来自#阅读的src ... VAL
        addq%R8,%RDI#SRC = + 8为下一个迭代        距离Val和Q%R10,R10%#组标志
        JE零
        JL负
        #别的正面
        addq%R11,%#RAX在伯爵RAX阳性 - count_pos ++        SUBQ%R11,%RDX
        JG循环
        JMP end_loop
负:
        addq%R11,%#RBX在伯爵底片RBX - count_neg ++        SUBQ%R11,%RDX
        JG循环
        JMP end_loop
零:
        addq%R11,%#RCX在伯爵RCX零 - count_zero ++        SUBQ%R11,%#RDX len--,设置标志
        JG回路#},而(len--→1)。砸锅当RDX = 0
end_loop:

有没有很多从展开中受益,因为循环体是如此之大。如果你做到了,你可能会做这样的:

请注意,我们只更新和检查 LEN 每次迭代一次。这意味着我们需要一个清理循环,但只有递减和检查一次将主要打败展开的目的。

由两个展开后,与尾复制

  irmovq $ 1,%#R11吊出循环
        irmovq $ 2,%R12
        irmovq $ 16%R8        子%R12,%RDI
        JL end_loop#展开的循环需要LEN> = 2循环:mrmovq(%RDI),R10%来自#阅读的src ... VAL
        mrmovq 8(%RDI),R9%#读取下一VAL这里,所以我们不必重复此
        addq%R8,%RDI#SRC = + 16为下一次迭代        距离Val和Q%R10,R10%#组标志
        JE ZERO1
        JL Negative1
        #别的Positive1
        addq%R11,%#RAX在伯爵RAX阳性 - count_pos ++        从val2中和Q%R9,R9%#组标志
        JE ZERO2
        JL Negative2
Positive2:
        addq%R11,%#RAX在伯爵RAX阳性 - count_pos ++        SUBQ%R12,%#RDX尾部循环
        JGE循环
        JMP end_loopNegative1:
        addq%R11,%#RBX在伯爵底片RBX - count_neg ++        从val2中和Q%R9,R9%#组标志
        JE ZERO2
        JG Positive2
Negative2:
        addq%R11,%#RBX在伯爵底片RBX - count_neg ++        SUBQ%R12,%#RDX尾部循环
        JGE循环
        JMP end_loopZERO1:
        addq%R11,%#RCX在伯爵RCX零 - count_zero ++        从val2中和Q%R9,R9%#组标志
        JG Positive2
        JL Negative2
ZERO2:
        addq%R11,%#RCX在伯爵RCX零 - count_zero ++        SUBQ%R12,%#RDX = len- 2,设置标志
        JGE回路#告吹时RDX = -1或-2
end_loop:#循环尾声处理那里有奇数个元素的情况下,让RDX = -1的位置:
        添加%R12,%RDX
        JNE all_done
        #else伪还有一做
        #...负载和测试一个元素

如果有一个差一错误在我的循环条件或东西,我也不会感到惊讶。


小丑一样在评论中指出,X86可以指望用底片

 特区$ 63%R10#广播符号位的所有位:-1或0
子%R10,%RBX#减法-1(或0):即,加1(或0)


更新:无分支的版本,使用CMOV效仿setcc

我们可以使用 CMOV 到寄存器设置为0或1,然后添加到我们的计数。这样就避免了所有的分支。 (0是该添加剂的身份。这个基本技术适用于具有单位元的任何操作。例如全1是用于标识元素与1是用于乘法的单位元。)

展开,这是简单的,但也有环路开销的3指令相比,需要重复8条指令。该收益将是相当小的。

  irmovq $ 1,%#R11吊出循环
        irmovq $ 8%R8
        MOV%的RDX,%#RBX是neg_count计算后循环:mrmovq(%RDI),R10%来自#阅读的src ... VAL
        addq%R8,%RDI#SRC = + 16为下一次迭代        距离Val和Q%R10,R10%#组标志        irmovq $ 0%R13
        cmovg%R11,R13%#效仿setcc
        irmovq $ 0%R14
        cmove%R11,R14%        添加%R13,%#RAX + pos_count =(VAL大于0)
        添加%R14,%#RCX + zero_count =(VAL == 0)        SUBQ%R11,%#RDX = len- 1,设置标志
        JG回路#告吹时RDX = 0
end_loop:        子%RAX,RBX%
        子%RCX,%#RBX = neg_count initial_len - pos_count - zero_count

如果分支机构(特别是未predictable分支机构)是昂贵的,这个版本会做的更好。使用小丑的建议计算从其他两个的计数中的一个的帮在这种情况下,大量的

有pretty良好的指令级并行在这里。两个独立setcc - >添加依存关系链可以并行运行一次的测试结果已经准备好

Say I have the following instruction, simply checks if a number is positive or not (negative or zero) and if it was positive add 1 to our counter (and we don't care if the numbers is negative or zero). I can do this with a simple loop unrolling:

Loop:   
    mrmovq (%rdi), %r10     # read val[0] from src
    andq %r10, %r10         # val[0] <= 0?
    jle Npos1               # if so, goto Npos:
    iaddq $1, %rax          # count++

Npos1:      
    mrmovq 8(%rdi), %r11    # read val[1] from src+8
    andq %r11,%r11          # val <= 0?
    jle Npos2               # if so, goto next npos:
    iaddq $1, %rax

Npos2:      
    mrmovq 16(%rdi), %r11   # read val[2] from src+8
    andq %r11,%r11          # val <= 0?
    jle Npos3               # if so, goto next npos:
    iaddq $1, %rax

My question is how I can get the same efficient structure if I want to check also for being zero or negative. In this case I'll have three counters (one for pos, one for neg and one for zero) One inefficient way would be like this. I am trying to make the same structure as the previous example (we are storing positive counts in %rax, negatives in %rbx and zeros in %rcx) :

Loop:   mrmovq (%rdi), %r10 # read val from src...
        andq %r10, %r10     # val <= 0?
        jle Npos            # if so, goto Npos:
        irmovq $1, %r11
        addq %r11, %rax     # Count positives in rax - count_pos++ 
        jmp Rest 
Npos:   andq %r10, %r10     # Not positive 
        je Zero
        irmovq $1, %r11
        addq %r11, %rbx     # Count negatives in rbx - count_neg++
        jmp Rest
Zero:   irmovq $1, %r11
        addq %r11, %rcx     # Count zeroes in rcx - count_zero++
Rest:   irmovq $1, %r10
        subq %r10, %rdx     # len--
        irmovq $8, %r10
        addq %r10, %rdi     # src++
        addq %r10, %rsi     # dst++
        andq %rdx,%rdx      # len > 0?
        jg Loop             # if so, goto Loop:

解决方案

update: see the very end for a non-branching version that should be much better, and trivial to unroll. But the rest of the answer is still worth reading, IMO.

I did find a way to save a couple instructions executed per value tested, with an unroll, but it's pretty minor compared to what I managed with a well-optimized version that used loop tail duplication. (see below).


y86 is too stripped-down to allow efficient code compared to real architectures in a lot of cases. For one thing, there doesn't seem to be a way to conditionally increment without clobbering flags. (x86 has lea rax, [rax+1]).

I don't see a way to save a lot of instructions by only counting positive and zero, and calculating the negative count from that after the loop. You still have to branch to test each value. update: no you don't, because you can emulate x86's setcc using y86's cmov!


However, I did find several big improvements in your code:

  • reuse the flags set by the first test, instead of re-testing

  • Another major thing is to hoist the %r11 = 1 out of the loop, so you can just increment with one insn. Setting up constants in registers is a really common thing even in real code. Most ISAs (including RISC load-store machines) have add-immediate instructions, like x86's add $1, %rax, but y86 doesn't so it needs this technique even for increments (x86 inc %rax)!

  • sub sets flags, so use them instead of doing a separate test.

Style issues:

With descriptive label names, you don't need as many comments.

Also, indent your operands to a consistent column, rather than just a single space after the variable-length mnemonic. It's more readable. I like indenting branch targets less, to make them stand out, but there are so many branches in this code that it actually just looks ugly :/

        irmovq  $1, %r11     # hoisted out of the loop
        irmovq  $8, %r8

Loop:   mrmovq  (%rdi), %r10 # read val from src...
        andq    %r10, %r10   # set flags from val
        jle    not_positive
        addq    %r11, %rax   # Count positives in rax - count_pos++ 
        jmp    Rest 
not_positive:
        je     Zero          # flags still from val
        addq    %r11, %rbx   # Count negatives in rbx - count_neg++
        jmp    Rest
Zero:
        addq    %r11, %rcx   # Count zeroes in rcx - count_zero++
Rest:
        addq    %r8, %rdi    # src+=8
        //addq %r8, %rsi     # dst++  // why?  Not used...  Also note that si stands for source index, so prefer keeping src pointers in rsi, and dest pointers in rdi for human readability.
        subq    %r11, %rdx   # len--, setting flags
        jg     Loop          # } while( len-- > 1).  fall through when rdx=0


loop tail duplication:

You can increase code size but decrease the number of instructions that actually run by duplicating the loop tail.

I also restructured the loop so there's only one taken branch per iteration in the loop body.

        irmovq $1, %r11       # hoisted out of the loop
        irmovq $8, %r8

Loop:   mrmovq (%rdi), %r10   # read val from src...
        addq   %r8, %rdi      # src+=8 for next iteration

        andq   %r10, %r10     # set flags from val
        je    Zero
        jl    Negative
        # else Positive
        addq   %r11, %rax     # Count positives in rax - count_pos++ 

        subq   %r11, %rdx
        jg    Loop
        jmp   end_loop
Negative:
        addq   %r11, %rbx     # Count negatives in rbx - count_neg++

        subq   %r11, %rdx
        jg    Loop 
        jmp   end_loop
Zero:
        addq   %r11, %rcx     # Count zeroes in rcx - count_zero++

        subq   %r11, %rdx     # len--, setting flags
        jg Loop               # } while( len-- > 1).  fall through when rdx=0
end_loop:

There's not a lot to gain from unrolling, since the loop body is so big. If you did, you might do it like this:

Note that we only update and check len once per iteration. This means we need a cleanup loop, but only decrementing and checking one at a time would mostly defeat the purpose of unrolling.

Unrolled by two, with tail duplication

        irmovq $1, %r11       # hoisted out of the loop
        irmovq $2, %r12
        irmovq $16, %r8

        sub    %r12, %rdi
        jl     end_loop       # unrolled loop requires len >= 2

Loop:   mrmovq (%rdi), %r10   # read val from src...
        mrmovq 8(%rdi), %r9   # read next val here so we don't have to duplicate this
        addq   %r8, %rdi      # src+=16 for next iteration

        andq   %r10, %r10     # set flags from val
        je    Zero1
        jl    Negative1
        # else Positive1
        addq   %r11, %rax     # Count positives in rax - count_pos++ 

        andq   %r9, %r9       # set flags from val2
        je    Zero2
        jl    Negative2
Positive2:
        addq   %r11, %rax     # Count positives in rax - count_pos++ 

        subq   %r12, %rdx     # loop tail
        jge   Loop
        jmp   end_loop

Negative1:
        addq   %r11, %rbx     # Count negatives in rbx - count_neg++

        andq   %r9, %r9       # set flags from val2
        je    Zero2
        jg    Positive2
Negative2:
        addq   %r11, %rbx     # Count negatives in rbx - count_neg++

        subq   %r12, %rdx     # loop tail
        jge   Loop 
        jmp   end_loop

Zero1:
        addq   %r11, %rcx     # Count zeroes in rcx - count_zero++

        andq   %r9, %r9       # set flags from val2
        jg    Positive2
        jl    Negative2
Zero2:
        addq   %r11, %rcx     # Count zeroes in rcx - count_zero++

        subq   %r12, %rdx     # len-=2, setting flags
        jge   Loop            # fall through when rdx=-1 or -2
end_loop:

# loop epilogue to handle cases where there was an odd number of elements, so rdx=-1 here:
        add   %r12, %rdx
        jne  all_done
        #else there's one more to do
        #... load and test a single element

I wouldn't be surprised if there's an off-by-one error in my loop conditions or something.


Like Jester pointed out in comments, x86 can count negatives with

sar   $63, %r10     # broadcast the sign bit to all bits: -1 or 0
sub   %r10, %rbx    # subtracting -1 (or 0): i.e. add 1 (or 0)


Update: non-branching version that uses cmov to emulate setcc

We can use cmov to set a register to 0 or 1, and then add that to our count. This avoids all branching. (0 is the additive identity. This basic technique works for any operation that has an identity element. e.g. all-ones is the identity element for AND. 1 is the identity element for multiply.)

Unrolling this is straightforward, but there are 3 instructions of loop overhead compared to 8 instructions that need to be repeated. The gains would be fairly small.

        irmovq $1, %r11       # hoisted out of the loop
        irmovq $8, %r8
        mov    %rdx, %rbx     # neg_count is calculated later

Loop:   mrmovq (%rdi), %r10   # read val from src...
        addq   %r8, %rdi      # src+=16 for next iteration

        andq   %r10, %r10     # set flags from val

        irmovq $0, %r13
        cmovg  %r11, %r13     # emulate setcc
        irmovq $0, %r14
        cmove  %r11, %r14

        add    %r13, %rax     # pos_count  += (val >  0)
        add    %r14, %rcx     # zero_count += (val == 0)

        subq   %r11, %rdx     # len-=1, setting flags
        jg    Loop            # fall through when rdx=0
end_loop:

        sub    %rax, %rbx
        sub    %rcx, %rbx     # neg_count = initial_len - pos_count - zero_count

If branches (esp. unpredictable branches) are expensive, this version will do much better. Using Jester's suggestion of calculating one of the counts from the other two helped a lot in this case.

There's pretty good instruction-level parallelism here. The two separate setcc -> add dependency chains can run in parallel once the test result is ready.

这篇关于使用循环展开计数正,负,零数的最有效的方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆