更快的16位乘法算法的8位MCU [英] Faster 16bit multiplication algorithm for 8-bit MCU

查看:294
本文介绍了更快的16位乘法算法的8位MCU的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在寻找一种算法来将两个整数比低于一筹。你是否想过这是个好主意? (该MCU - 在微小的84/85或类似的 - 在这个code运行时没有MUL / DIV操作员)

I'm searching for an algorithm to multiply two integer numbers that is better than the one below. Do you have a good idea about that? (The MCU - AT Tiny 84/85 or similar - where this code runs has no mul/div operator)

uint16_t umul16_(uint16_t a, uint16_t b)
{
    uint16_t res=0;

    while (b) {
        if ( (b & 1) )
            res+=a;
        b>>=1;
        a+=a;
    }

    return res;
}

这个算法,在使用AVR-gcc编译器编译AT微型八十四分之八十五,几乎是相​​同的__mulhi3的AVR-GCC生成算法。

This algorithm, when compiled for AT Tiny 85/84 using the avr-gcc compiler, is almost identical to the algorithm __mulhi3 the avr-gcc generates.

AVR-GCC算法:

avr-gcc algorithm:

00000106 <__mulhi3>:
 106:   00 24           eor r0, r0
 108:   55 27           eor r21, r21
 10a:   04 c0           rjmp    .+8         ; 0x114 <__mulhi3+0xe>
 10c:   08 0e           add r0, r24
 10e:   59 1f           adc r21, r25
 110:   88 0f           add r24, r24
 112:   99 1f           adc r25, r25
 114:   00 97           sbiw    r24, 0x00   ; 0
 116:   29 f0           breq    .+10        ; 0x122 <__mulhi3+0x1c>
 118:   76 95           lsr r23
 11a:   67 95           ror r22
 11c:   b8 f3           brcs    .-18        ; 0x10c <__mulhi3+0x6>
 11e:   71 05           cpc r23, r1
 120:   b9 f7           brne    .-18        ; 0x110 <__mulhi3+0xa>
 122:   80 2d           mov r24, r0
 124:   95 2f           mov r25, r21
 126:   08 95           ret

umul16_算法:

umul16_ algorithm:

00000044 <umul16_>:
  44:   20 e0           ldi r18, 0x00   ; 0
  46:   30 e0           ldi r19, 0x00   ; 0
  48:   61 15           cp  r22, r1
  4a:   71 05           cpc r23, r1
  4c:   49 f0           breq    .+18        ; 0x60 <umul16_+0x1c>
  4e:   60 ff           sbrs    r22, 0
  50:   02 c0           rjmp    .+4         ; 0x56 <umul16_+0x12>
  52:   28 0f           add r18, r24
  54:   39 1f           adc r19, r25
  56:   76 95           lsr r23
  58:   67 95           ror r22
  5a:   88 0f           add r24, r24
  5c:   99 1f           adc r25, r25
  5e:   f4 cf           rjmp    .-24        ; 0x48 <umul16_+0x4>
  60:   c9 01           movw    r24, r18
  62:   08 95           ret


编辑:的指令集可这里


The instruction set is available here.

推荐答案

摘要

  1. 在考虑换 A B (原提案)
  2. 在试图避免的条件跳转(不成功优化)
  3. 输入公式的重塑(估计35%的涨幅)
  4. 在删除重复的移
  5. 展开循环:将最佳组装
  6. 说服编译器给予最佳的组件
  1. Consider swapping a and b (Original proposal)
  2. Trying to avoid conditional jumps (Not successful optimization)
  3. Reshaping of the input formula (estimated 35% gain)
  4. Removing duplicated shift
  5. Unrolling the loop: The "optimal" assembly
  6. Convincing the compiler to give the optimal assembly


1。考虑交换 A B


1. Consider swapping a and b

一个改进是将先比较A和B,以及交换他们,如果 A&LT; b :您应该使用b 两者的较小的,这样就具有周期的最小数目。需要注意的是,你能避免通过交换的复制code 的(如果(A&LT; B)然后跳转到一个镜像code款),但我怀疑它的价值。

One improvement would be to first compare a and b, and swap them if a<b: you should use as b the smaller of the two, so that you have the minimum number of cycles. Note that you can avoid swapping by duplicating the code (if (a<b) then jump to a mirrored code section), but I doubt it's worth.


2。试图避免条件跳转(不成功优化)

尝试:

uint16_t umul16_(uint16_t a, uint16_t b)
{
    ///Here swap if necessary
    uint16_t accum=0;

    while (b) {
        accum += ((b&1) * uint16_t(0xffff)) & a; //Hopefully this multiplication is optimized away
        b>>=1;
        a+=a;
    }

    return accum;
}

这是塞尔吉奥的反馈,这并没有带来改善。

From Sergio's feedback, this didn't bring improvements.


3。输入公式的重塑

考虑到目标架构已基本只有8位的指令,如果你单独输入变量的上,下8位,你可以写:

Considering that the target architecture has basically only 8bit instructions, if you separate the upper and bottom 8 bit of the input variables, you can write:

a = a1 * 0xff + a0;
b = b1 * 0xff + b0;

a * b = a1 * b1 * 0xffff + a0 * b1 * 0xff + a1 * b0 * 0xff + a0 * b0

现在,最酷的是,我们可以扔掉这个词 A1 * B1 * 0xFFFF的,因为 0xFFFF的发送出去的寄存器的。

Now, the cool thing is that we can throw away the term a1 * b1 * 0xffff, because the 0xffff send it out of your register.

(16bit) a * b = a0 * b1 * 0xff + a1 * b0 * 0xff + a0 * b0

此外, A0 * B1 A1 * B0 词是可以治疗的,因为作为8位乘法, 0xFF的:任何部分超过256会的送出寄存器的

Furthermore, the a0*b1 and a1*b0 term can be treated as 8bit multiplications, because of the 0xff: any part exceeding 256 will be sent out of the register.

到目前为止精彩! ......但是,这里来的现实惊人的: A0 * B0 已被视为一个16位乘法,因为你必须把所有产生的位。 A0 将必须保持在16位,使移位的左派。这个乘法有一半 A * B 的迭代,它的部分的8位(因为B0),但你还是要考虑到前面提到的2 8位乘法,和最终的结果组合物。 我们还需要进一步的整形!

So far exciting! ... But, here comes reality striking: a0 * b0 has to be treated as a 16 bit multiplication, as you'll have to keep all resulting bits. a0 will have to be kept on 16 bit to allow shift lefts. This multiplication has half of the iterations of a * b, it is in part 8bit (because of b0) but you still have to take into account the 2 8bit multiplications mentioned before, and the final result composition. We need further reshaping!

所以现在我收集 B0

(16bit) a * b = a0 * b1 * 0xff + b0 * (a0 + a1 * 0xff)

(a0 + a1 * 0xff) = a

所以我们得到:

So we get:

(16bit) a * b = a0 * b1 * 0xff + b0 * a

如果N是原始的 A * B 的周期,现在第一项是一个8位乘以N / 2周期,第二个是16位* 8位乘法用N / 2个周期。考虑的M每次迭代的指令在原 A * B 数,8位* 8bit的迭代有一半的指令和16位* 8位约80%的M(一个班次指令少B0相比,B)。放在一起我们 N / 2 * M / 2 + N / 2 * M * 0.8 = N * M * 0.65 的复杂性,所以预期的节省〜35 %相对于原来的 N * M 。声音有为

If N were the cycles of the original a * b, now the first term is an 8bit multiplication with N/2 cycles, and the second a 16bit * 8bit multiplication with N/2 cycles. Considering M the number of instructions per iteration in the original a*b, the 8bit*8bit iteration has half of the instructions, and the 16bit*8bit about 80% of M (one shift instruction less for b0 compared to b). Putting together we have N/2*M/2+N/2*M*0.8 = N*M*0.65 complexity, so an expected saving of ~35% with respect to the original N*M. Sounds promising.

这是在code:

uint16_t umul16_(uint16_t a, uint16_t b)
{
    uint8_t res1 = 0;

    uint8_t a0 = a & 0xff; //This effectively needs to copy the data
    uint8_t b0 = b & 0xff; //This should be optimized away
    uint8_t b1 = b >>8; //This should be optimized away

    //Here a0 and b1 could be swapped (to have b1 < a0)
    while (b1) {///Maximum 8 cycles
        if ( (b1 & 1) )
            res1+=a0;
        b1>>=1;
        a0+=a0;
    }

    uint16_t res = (uint16_t) res1 * 256; //Should be optimized away, it's not even a copy!

    //Here swapping wouldn't make much sense
    while (b0) {///Maximum 8 cycles
        if ( (b0 & 1) )
            res+=a;
        b0>>=1;
        a+=a;
    }

    return res;
}

另外,在2个循环分裂应该加倍,在理论上,跳越某些周期的机会:N / 2个可能是轻微的高估

Also, the splitting in 2 cycles should double, in theory, the chance of skipping some cycles: N/2 might be a slight overestimate.

一个微小的进一步改善包括避免过去的,不必要的换挡为 A 变量。小边注:如果任何B0或B1是零它会导致2个额外的指令。 但是的同时也节省了B0和B1的第一次检查,这是最昂贵的,因为它不能用于移位操作检查零标志状态for循环的条件跳转。

A tiny further improvement consist in avoiding the last, unnecessary shift for the a variables. Small side note: if either b0 or b1 are zero it causes 2 extra instructions. But it also saves the first check of b0 and b1, which is the most expensive because it cannot check the zero flag status from the shift operation for the conditional jump of the for loop.

uint16_t umul16_(uint16_t a, uint16_t b)
{
    uint8_t res1 = 0;

    uint8_t a0 = a & 0xff; //This effectively needs to copy the data
    uint8_t b0 = b & 0xff; //This should be optimized away
    uint8_t b1 = b >>8; //This should be optimized away

    //Here a0 and b1 could be swapped (to have b1 < a0)
    if ( (b1 & 1) )
        res1+=a0;
    b1>>=1;
    while (b1) {///Maximum 7 cycles
        a0+=a0;
        if ( (b1 & 1) )
            res1+=a0;
        b1>>=1;
    }

    uint16_t res = (uint16_t) res1 * 256; //Should be optimized away, it's not even a copy!

    //Here swapping wouldn't make much sense
    if ( (b0 & 1) )
        res+=a;
    b0>>=1;
    while (b0) {///Maximum 7 cycles
        a+=a;
        if ( (b0 & 1) )
            res+=a;
        b0>>=1;
    }

    return res;
}


4。删除重复的移


4. Removing duplicated shift

是否还有改进的空间? ,如 A0 字节被转移的两次的。所以应该在组合所述两个环是有利的。这可能是一个有点棘手说服编译器做的正是我们想要的,尤其是与结果寄存器。

Is there still space for improvement? Yes, as the bytes in a0 gets shifted two times. So there should be a benefit in combining the two loops. It might be a little bit tricky to convince the compiler to do exactly what we want, especially with the result register.

所以,我们在处理同一个周期 B0 B1 。处理的第一件事是,这是循环退出条件?到目前为止,使用 B0 / B1 清除状态提供了方便,因为它避免了使用计数器。此外,右移后,标志可能已经设置,如果操作的结果是零,而该标志可以允许不经进一步的评估条件跳转

So, we process in the same cycle b0 and b1. The first thing to handle is, which is the loop exit condition? So far using b0/b1 cleared status has been convenient because it avoids using a counter. Furthermore, after the shift right, a flag might be already set if the operation result is zero, and this flag might allow a conditional jump without further evaluations.

现在的循环退出条件可以是的失败(B0 || B1)。然而,这可能需要昂贵的计算。一个解决方案是比较B0和B1,并跳转到2个不同的code节:如果 B1&GT; B0 我测试的 B1 的条件,否则我就试 B0 的条件。我preFER另一种解决方案,具有2个循环,第一个出口时 B0 为零,第二次当 B1 是零。会有这样的情况中,我会做零迭代的 B1 。问题的关键是,在第二圈我知道 B0 为零,这样我就可以减少操作的次数进行。

Now the loop exit condition could be the failure of (b0 || b1). However this could require expensive computation. One solution is to compare b0 and b1 and jump to 2 different code sections: if b1 > b0 I test the condition on b1, else I test the condition on b0. I prefer another solution, with 2 loops, the first exit when b0 is zero, the second when b1 is zero. There will be cases in which I will do zero iterations in b1. The point is that in the second loop I know b0 is zero, so I can reduce the number of operations performed.

现在,让我们忘了退出条件,并尝试加入2环路previous部分。

Now, let's forget about the exit condition and try to join the 2 loops of the previous section.

uint16_t umul16_(uint16_t a, uint16_t b)
{
    uint16_t res = 0;

    uint8_t b0 = b & 0xff; //This should be optimized away
    uint8_t b1 = b >>8; //This should be optimized away

    //Swapping probably doesn't make much sense anymore
    if ( (b1 & 1) )
        res+=(uint16_t)((uint8_t)(a && 0xff))*256;
    //Hopefully the compiler understands it has simply to add the low 8bit register of a to the high 8bit register of res

    if ( (b0 & 1) )
        res+=a;

    b1>>=1;
    b0>>=1;
    while (b0) {///N cycles, maximum 7
        a+=a;
        if ( (b1 & 1) )
            res+=(uint16_t)((uint8_t)(a & 0xff))*256;
        if ( (b0 & 1) )
            res+=a;
        b1>>=1;
        b0>>=1; //I try to put as last the one that will leave the carry flag in the desired state
    }

    uint8_t a0 = a & 0xff; //Again, not a real copy but a register selection

    while (b1) {///P cycles, maximum 7 - N cycles
        a0+=a0;
        if ( (b1 & 1) )
            res+=(uint16_t) a0 * 256;
        b1>>=1;
    }
    return res;
}

感谢塞尔吉奥提供生成的组件(-Ofast)。乍看之下,考虑到 MOV 的离谱金额在code,似乎编译器没有跨$ P $磅,因为我想在暗示我给了他跨preT的寄存器。

Thanks Sergio for providing the assembly generated (-Ofast). At first glance, considering the outrageous amount of mov in the code, it seems the compiler did not interpret as I wanted the hints I gave to him to interpret the registers.

输入是:R22,R23和r24,25
。 AVR指令集:快速参考,的详细文档

Inputs are: r22,r23 and r24,25.
AVR Instruction Set: Quick reference, Detailed documentation

sbrs //Tests a single bit in a register and skips the next instruction if the bit is set. Skip takes 2 clocks. 
ldi // Load immediate, 1 clock
sbiw // Subtracts immediate to *word*, 2 clocks

    00000010 <umul16_Antonio5>:
      10:    70 ff           sbrs    r23, 0
      12:    39 c0           rjmp    .+114        ; 0x86 <__SREG__+0x47>
      14:    41 e0           ldi    r20, 0x01    ; 1
      16:    00 97           sbiw    r24, 0x00    ; 0
      18:    c9 f1           breq    .+114        ; 0x8c <__SREG__+0x4d>
      1a:    34 2f           mov    r19, r20
      1c:    20 e0           ldi    r18, 0x00    ; 0
      1e:    60 ff           sbrs    r22, 0
      20:    07 c0           rjmp    .+14         ; 0x30 <umul16_Antonio5+0x20>
      22:    28 0f           add    r18, r24
      24:    39 1f           adc    r19, r25
      26:    04 c0           rjmp    .+8          ; 0x30 <umul16_Antonio5+0x20>
      28:    e4 2f           mov    r30, r20
      2a:    45 2f           mov    r20, r21
      2c:    2e 2f           mov    r18, r30
      2e:    34 2f           mov    r19, r20
      30:    76 95           lsr    r23
      32:    66 95           lsr    r22
      34:    b9 f0           breq    .+46         ; 0x64 <__SREG__+0x25>
      36:    88 0f           add    r24, r24
      38:    99 1f           adc    r25, r25
      3a:    58 2f           mov    r21, r24
      3c:    44 27           eor    r20, r20
      3e:    42 0f           add    r20, r18
      40:    53 1f           adc    r21, r19
      42:    70 ff           sbrs    r23, 0
      44:    02 c0           rjmp    .+4          ; 0x4a <__SREG__+0xb>
      46:    24 2f           mov    r18, r20
      48:    35 2f           mov    r19, r21
      4a:    42 2f           mov    r20, r18
      4c:    53 2f           mov    r21, r19
      4e:    48 0f           add    r20, r24
      50:    59 1f           adc    r21, r25
      52:    60 fd           sbrc    r22, 0
      54:    e9 cf           rjmp    .-46         ; 0x28 <umul16_Antonio5+0x18>
      56:    e2 2f           mov    r30, r18
      58:    43 2f           mov    r20, r19
      5a:    e8 cf           rjmp    .-48         ; 0x2c <umul16_Antonio5+0x1c>
      5c:    95 2f           mov    r25, r21
      5e:    24 2f           mov    r18, r20
      60:    39 2f           mov    r19, r25
      62:    76 95           lsr    r23
      64:    77 23           and    r23, r23
      66:    61 f0           breq    .+24         ; 0x80 <__SREG__+0x41>
      68:    88 0f           add    r24, r24
      6a:    48 2f           mov    r20, r24
      6c:    50 e0           ldi    r21, 0x00    ; 0
      6e:    54 2f           mov    r21, r20
      70:    44 27           eor    r20, r20
      72:    42 0f           add    r20, r18
      74:    53 1f           adc    r21, r19
      76:    70 fd           sbrc    r23, 0
      78:    f1 cf           rjmp    .-30         ; 0x5c <__SREG__+0x1d>
      7a:    42 2f           mov    r20, r18
      7c:    93 2f           mov    r25, r19
      7e:    ef cf           rjmp    .-34         ; 0x5e <__SREG__+0x1f>
      80:    82 2f           mov    r24, r18
      82:    93 2f           mov    r25, r19
      84:    08 95           ret
      86:    20 e0           ldi    r18, 0x00    ; 0
      88:    30 e0           ldi    r19, 0x00    ; 0
      8a:    c9 cf           rjmp    .-110        ; 0x1e <umul16_Antonio5+0xe>
      8c:    40 e0           ldi    r20, 0x00    ; 0
      8e:    c5 cf           rjmp    .-118        ; 0x1a <umul16_Antonio5+0xa>


5。展开循环:将最佳组装


5. Unrolling the loop: The "optimal" assembly

通过所有这些信息,让我们试着去了解什么是给定结构约束的最优的解决方案。 最佳是引用的原因是什么最佳输入的数据,我们希望优化的内容取决于很多。 假设我们要优化的循环次数在最坏的情况下。如果我们去的最坏的情况下,循环展开是一个合理的选择:我们知道我们有8个周期,我们删除所有测试,以了解是否我们完成了(如果B0和B1是零)。到目前为止,我们所使用的伎俩我们转移,我们检查零标志来检查,如果我们有退出循环。删除这一要求,我们可以使用不同的技巧:我们转移,和我们检查进位(在我们换挡时发出的寄存器位)的明白我是否应该更新结果。由于指令集,在组装叙事code指令成为继。

With all this information, let's try to understand what would be the "optimal" solution given the architecture constraints. "Optimal" is quoted because what is "optimal" depends a lot on the input data and what we want to optimize. Let's assume we want to optimize on number of cycles on the worst case. If we go for the worst case, loop unrolling is a reasonable choice: we know we have 8 cycles, and we remove all tests to understand if we finished (if b0 and b1 are zero). So far we used the trick "we shift, and we check the zero flag" to check if we had to exit a loop. Removed this requirement, we can use a different trick: we shift, and we check the carry bit (the bit we sent out of the register when shifting) to understand if I should update the result. Given the instruction set, in assembly "narrative" code the instructions become the following.

//Input: a = a1 * 256 + a0, b = b1 * 256 + b0
//Output: r = r1 * 256 + r0

Preliminary:
P0 r0 = 0 (CLR)
P1 r1 = 0 (CLR)

Main block:
0 Shift right b0 (LSR)
1 If carry is not set skip 2 instructions = jump to 4 (BRCC)
2 r0 = r0 + a0 (ADD)
3 r1 = r1 + a1 + carry from prev. (ADC)
4 Shift right b1 (LSR)
5 If carry is not set skip 1 instruction = jump to 7 (BRCC)
6 r1 = r1 + a0 (ADD)
7 a0 = a0 + a0 (ADD)  
8 a1 = a1 + a1 + carry from prev. (ADC)

[Repeat same instructions for another 7 times]

分支是采取一条指令,如果没有跳造成的,另有2。其他所有指令都是1个周期。所以B1状态对循环数目没有影响,而我们有9个周期,如果B0 = 1,和8个周期,如果B0 = 0计数的初始化,8次迭代以及跳过的A0和A1的最后更新,在最坏的情况下(B0 = 11111111b),我们一共的有8 * 9 + 2 - 2 = 72次。我不知道哪个C ++实现将说服编译器生成它。也许:

Branching takes 1 instruction if no jump is caused, 2 otherwise. All other instructions are 1 cycle. So b1 state has no influence on the number of cycles, while we have 9 cycles if b0 = 1, and 8 cycles if b0 = 0. Counting the initialization, 8 iterations and skipping the last update of a0 and a1, in the worse case (b0 = 11111111b), we have a total of 8 * 9 + 2 - 2 = 72 cycles. I wouldn't know which C++ implementation would convince the compiler to generate it. Maybe:

 void iterate(uint8_t& b0,uint8_t& b1,uint16_t& a, uint16_t& r) {
     const uint8_t temp0 = b0;
     b0 >>=1;
     if (temp0 & 0x01) {//Will this convince him to use the carry flag?
         r += a;
     }
     const uint8_t temp1 = b1;
     b1 >>=1;
     if (temp1 & 0x01) {
         r+=(uint16_t)((uint8_t)(a & 0xff))*256;
     }
     a += a;
 }

 uint16_t umul16_(uint16_t a, uint16_t b) {
     uint16_t r = 0;
     uint8_t b0 = b & 0xff;
     uint8_t b1 = b >>8;

     iterate(b0,b1,a,r);
     iterate(b0,b1,a,r);
     iterate(b0,b1,a,r);
     iterate(b0,b1,a,r);
     iterate(b0,b1,a,r);
     iterate(b0,b1,a,r);
     iterate(b0,b1,a,r);
     iterate(b0,b1,a,r); //Hopefully he understands he doesn't need the last update for variable a
     return r;
 }

不过,考虑到previous因此,要真正获得所需的code应该真正切换到组件!

But, given the previous result, to really obtain the desired code one should really switch to assembly!

最后,我们也可以考虑循环展开一个更极端的跨pretation:在SBRC / SBRS指令让来测试寄存器的特定位。因此我们能够避免变速B0和B1,并且在每个周期检查一个不同比特。唯一的问题是,这些指令只允许跳过下一条指令,而不是一个自定义的跳跃。所以,在叙事code它看起来是这样的:

Finally one could also consider a more extreme interpretation of the loop unrolling: the sbrc/sbrs instructions allows to test on a specific bit of a register. We can therefore avoid shifting b0 and b1, and at each cycle check a different bit. The only problem is that those instructions only allow to skip the next instruction, and not for a custom jump. So, in "narrative code" it will look like this:

Main block:
0 Test Nth bit of b0 (SBRS). If set jump to 2 (+ 1cycle) otherwise continue with 1
1 Jump to 4 (RJMP)
2 r0 = r0 + a0 (ADD)
3 r1 = r1 + a1 + carry from prev. (ADC)
4 Test Nth bit of (SBRC). If cleared jump to 6 (+ 1cycle) otherwise continue with 5
5 r1 = r1 + a0 (ADD)
6 a0 = a0 + a0 (ADD)  
7 a1 = a1 + a1 + carry from prev. (ADC)

而第二代可以节省1个周期,还有第二代没有明显的优势。不过,我相信C ++ code可能会更容易跨preT编译器。考虑到8个周期,初始化和跳跃的A0和A1上一次更新,我们现在有 64个周期

C ++ code:

C++ code:

 template<uint8_t mask>
 void iterateWithMask(const uint8_t& b0,const uint8_t& b1, uint16_t& a, uint16_t& r) {
     if (b0 & mask)
         r += a;
     if (b1 & mask)
         r+=(uint16_t)((uint8_t)(a & 0xff))*256;
     a += a;
 }

 uint16_t umul16_(uint16_t a, const uint16_t b) {
     uint16_t r = 0;
     const uint8_t b0 = b & 0xff;
     const uint8_t b1 = b >>8;

     iterateWithMask<0x01>(b0,b1,a,r);
     iterateWithMask<0x02>(b0,b1,a,r);
     iterateWithMask<0x04>(b0,b1,a,r);
     iterateWithMask<0x08>(b0,b1,a,r);
     iterateWithMask<0x10>(b0,b1,a,r);
     iterateWithMask<0x20>(b0,b1,a,r);
     iterateWithMask<0x40>(b0,b1,a,r);
     iterateWithMask<0x80>(b0,b1,a,r);

     //Hopefully he understands he doesn't need the last update for a
     return r;
 }

请注意,在这个实施为0x01,0X02是不是一个真正的价值,而只是一个暗示,编译器知道要测试的位。因此,面具不能被右移获得:从迄今为止看到所有其他功能不同,这也着实没有相应的循环版本

Note that in this implementation the 0x01, 0x02 are not a real value, but just a hint to the compiler to know which bit to test. Therefore, the mask cannot be obtained by shifting right: differently from all other functions seen so far, this has really no equivalent loop version.

一个大问题是,

r+=(uint16_t)((uint8_t)(a & 0xff))*256;

这应该是研究的上寄存器中只是一笔带 A 下寄存器。 没有得到除preTED如我所愿。其他选项:

It should be just a sum of the upper register of r with the lower register of a. Does not get interpreted as I would like. Other option:

r+=(uint16_t) 256 *((uint8_t)(a & 0xff));


6。说服编译器给出最优组装


6. Convincing the compiler to give the optimal assembly

我们也可以保留 A 常数,而是转移的结果研究。在这种情况下,我们处理 B 从最显著位开始。复杂性是等价的,但它可能是编译器更容易消化。另外,这个时候我们要小心明确写入了最后的循环,一定不能做进一步右移了研究

We can also keep a constant, and shift instead the result r. In this case we process b starting from the most significant bit. The complexity is equivalent, but it might be easier for the compiler to digest. Also, this time we have to be careful to write explicitly the last loop, which must not do a further shift right for r.

 template<uint8_t mask>
 void inverseIterateWithMask(const uint8_t& b0,const uint8_t& b1,const uint16_t& a, const uint8_t& a0, uint16_t& r) {
     if (b0 & mask)
         r += a;
     if (b1 & mask)
         r+=(uint16_t)256*a0; //Hopefully easier to understand for the compiler?
     r += r;
 }

 uint16_t umul16_(const uint16_t a, const uint16_t b) {
     uint16_t r = 0;
     const uint8_t b0 = b & 0xff;
     const uint8_t b1 = b >>8;
     const uint8_t a0 = a & 0xff;

     inverseIterateWithMask<0x80>(b0,b1,a,r);
     inverseIterateWithMask<0x40>(b0,b1,a,r);
     inverseIterateWithMask<0x20>(b0,b1,a,r);
     inverseIterateWithMask<0x10>(b0,b1,a,r);
     inverseIterateWithMask<0x08>(b0,b1,a,r);
     inverseIterateWithMask<0x04>(b0,b1,a,r);
     inverseIterateWithMask<0x02>(b0,b1,a,r);

     //Last iteration:
     if (b0 & 0x01)
         r += a;
     if (b1 & 0x01)
         r+=(uint16_t)256*a0;

     return r;
 }

这篇关于更快的16位乘法算法的8位MCU的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆