tinyAVR:最有名的乘法例程8位和16位的因素是什么? [英] tinyAVR: best known multiplication routines for 8-bit and 16-bit factors?
问题描述
更快比avr200b.asm?该 mpy8u
-routine从 avr200b.asm 一>为Atmel的AVR系列的处理器没有实现任何的 MUL
的说明似乎pretty的通用的,但 mpy16u
看起来稀松旋转双下肢结果字节16倍而不是8 安东尼 presented一个的fast 使用64个周期最坏的情况下不包括调用/返回开销16×16→16位无符号的乘法运算。
我随便建议作为优化目标的最坏的情况下周期计数的字计数的(RAM和闪存),寄存器使用和预期的周期数的,为了降低优先级。
(还有的降低核心的AVR单片机(个位数-ATtiny,10/20/40)与差异,包括时间,我建议忽略。)
"Faster than avr200b.asm"? The mpy8u
-routine from avr200b.asm for those processors of Atmel's AVR family that do not implement any of the MUL
instructions seems pretty generic, but mpy16u
looks sloppy for rotating both lower result bytes 16 times instead of 8. Antonio presented a fast 16×16→16 unsigned multiplication using 64 cycles worst case excluding call/return overhead.
I arbitrarily suggest as optimisation goals worst case cycle count, word count (RAM and flash), register usage, and expected cycle count in order of decreasing priority.
(There are reduced core AVRs ("single digit"-ATtiny, 10/20/40) with differences including timing, which I suggested to ignore.)
(注意:不要把任何索赔这里是理所当然的,至少在没有独立的肯定)
(Caution: Don't take any claim herein for granted, at least not without independent affirmation.)
什么目前最有名的是8×8→8/16,16×16→16/32和16×8→16/24位元的乘法例程的AVR无 MUL
?
What are best currently known 8×8→8/16, 16×16→16/32 and 16×8→16/24 bit multiplication routines for AVRs without MUL
?
推荐答案
实现接壤的空间意识(供参考,如果不理智)。
使用时,资源可能应该是合格(G:胡乱猜测,G:猜对,E:猜测,E:预计,S:模拟,一:analised,答:analised&功放;证实,如果通过仿真,M:测量)(字×worstCaseCycleCount是一种成本的措施类似于面积×延迟IC设计(的好处单数字?))
Implementations bordering on space conscious (for reference, if not sanity).
Resources used, should probably be qualified (g: wild guess, G: guessed, e: educated guess, E: estimated, s: simulated, a: analised, A: analised & substantiated, if by simulation, m: measured) (words×worstCaseCycleCount is a cost measure akin to Area×Delay in IC design (single figure of "merit"?) )
algoritm bits cycles words regs remarks
wc exp ×wccc excl.
a,b,p
shift factor left 16×16→16(61 56 87 5307 see other
62 57 62 3844 answer)
73 68 37 2701
81 77 24 1944 (see edit history)
85 70g 15 1275 w*expcc~1050
108 64g 18 1944 w*expcc~1150
(jump table, for reference 51E49g 888e 44K G (almost done)
44E39g2888E127K e)
(我查了相同的wordcycle项不止一次。)
宏,应该令人信服地分解出来
(I checked the identical "wordcycle entries" more than once.)
Macros, should conceivably be factored out
.MACRO doubleA ; adds (shifts/weights) factor "a"
add a0, a0 ; +1
adc a1, a1 ; +2
.EndM
.MACRO doHighB ; "does" bit in b1, bit number as a parameter
sbrc b1, @0 ; 1
add p1, a0 ; 2
.EndM
.MACRO condAdd
doHighB @0 ; +2
sbrs b0, @1 ; +3
rjmp PC+3 ;+4/5
addA ; +6
.EndM
.MACRO step16; "do" 2 bits, bit# in b1 and b0 as a parameter
condAdd @0, @0 ; +6
doubleA ; +8
.EndM
16×16→16位,八十一分之八十五周期,15/24的话:
16×16→16 bits, 85/81 cycles, 15/24 words:
mpy16x16: ; 0
clr p0 ; 1
clr p1 ; 2
; wanting early out: shifting the factor; faster from Little End
lsr b0 ; 3
brcc shiftB1 ;4/5
addFull:
addA ; 2
shiftB1: ; due to handling this 2nd multiplier
lsr b1 ; 3 bit even if the multiplicand is zero
brcc pc+2 ;4/5 after the first shift, the earlyOutA
addHigh: ; variant is 3 cycles slower than 4.8
add p1, a0 ; 5 libgcc __mulhi3 - for * 0 or 0x8000
shiftA:
doubleA ; 7 why is adc zero-flag handling ...
#if 1||earlyOutA
brne shiftB0 ;+1/2 7 ... different from subc/sbci/cpc?
tst a0 ;+ 2
breq done ;+ 3/-1upto-69?
#endif
shiftB0:
lsr b0 ; 8
brcs addFull ;9/10
sbci b1, 0 ; 10 presume zero or high reg?
brne shiftB1 ;11/12-2
done: ; wc: 8*10+5=85 @15+1 words (?!)
ret ; best: 14 (0=b&0xfffe) (none for a)
;(earlyOutA: wc: 8*13+4=108 @18+1 words)
16×16→16位,73次,37字:
16×16→16 bits, 73 cycles, 37 words:
mpy16x16: ; 0
clr p0 ; 1
clr p1 ; 2
rcall nibble ; 9 incl. ret (>16bit PC AVRs have mul(?))
swap b0 ; 10
swap b1 ; 11
doubleA ; 13
nibble:
step16 0 ; +8
step16 1 ; +16
step16 2 ; +24
doHighB 3 ; +26
sbrs b0, 3 ;27/28
ret ; yikes
addA ; +30
ret ; 30 Hrrm *2+13 = 73 @ 4*8+5 = 37 words
这篇关于tinyAVR:最有名的乘法例程8位和16位的因素是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!