在LFSR实现中有效的位摆动 [英] Efficient bit-fiddling in a LFSR implementation
问题描述
以下是使用列表和 Data.Bits
的基线Haskell代码:
$ $ p $
import Control.Monad(when)
import Data.Bits(Bits,shift,testBit,xor,(。& ;.) ,(。|。))
import System.Environment(getArgs)
import System.Exit(exitFailure,exitSuccess)
tap :: [[Int]]
tap = [
[],[],[],[3,2],
[4,3],[5,3],[6,5],[7,6] ,
[8,6,5,4],[9,5],[10,7],[11,9],
[12,6,4,1],[13, 4,3,1],[14,5,3,1],[15,14],
[16,15,13,4],[17,14],[18,11],[ 19,6,2,1],
[20, 17,21,23,28,23,18],
[24,23,22,17],[25,22],[26,6,2,1] ,[27,5,2,1],
[28,25],[29,27],[30,6,4,1],[31,28],
[32, 22,2,1],[33,20],[34,27,2,1],[35,33],
[36,25],[37,5,4,3,2, 1],[38,6,5,1],[39,35],
[40,38,21,19],[41,38],[42,41,20,19],[ 43,42,38,37],
[44,43,18,17],[45,44,42,41],[46,45,26,25],[47,42],
[48,47,21,20],[49,40],[50,49,24,23],[51,50,36,35],
[52,49],[ 53,52,38,37],[54,53,18,17],[55,31],
[56,55,35,34],[57,50],[58,39] ,[59,58,38,37],
[60,59],[61,60,46,45],[62,61,6,5],[63,62]]
xor':: [Bool] - > Bool
xor'= foldr xor False
mask ::(Num a,Bits a)=> Int - > a
掩码len = shift 1 len - 1
advance :: Int - > [Int] - > Int - > Int
advance len tap lfsr
| d0 =移位
|否则=移动。|。 1
其中
转移= shift lfsr 1。& ;.掩码len
d0 = xor'$ map(testBit lfsr)点击'
tap'=地图(减1)点击
main :: IO()
main = do
args< - getArgs
when(null args)$ fail用法:lsfr<比特数>
let len = read $ head args
when(len <8)$ fail不需要LFSR
let out = last $ take(shift 1 len)$ iterate(advance len (tap !! len))0
if out == 0 then do
putStrOK\\\
exitSuccess
else do
putStrFAIL\ n
exitFailure
基本上它会测试在 tap :: [[Int]]
中定义的LSFR 对于任何给定的位长度是最大长度。 (更确切地说,它只是检查LSFR是否在2 n 迭代后达到初始状态(零)。)根据探查器,最昂贵的线是反馈位 d0 = xor'$ map(testBit lfsr)点击'
。
至今已尝试过:
Data.Array
:尝试被放弃,因为有没有foldl / r
Data.Vector
:略快于基线
我使用的编译器选项是: -O2
, LTS Haskell 8.12(ghc-8.0.2)
。
参考C ++程序可以在 gist.github.com 。
Haskell代码不能像C那样快代码,但两个数量级太多了,必须有更好的方法来做这个小小的摆弄。
更新:在答案中应用优化的结果
- 输入28的参考C ++程序使用LLVM 8.0.0编译,在我的机器上以0.67s运行(与clang 3.7相同,速度稍慢,为0.68s)
基准Haskell代码运行速度大约慢100倍(因为空间效率不高,请不要尝试它与输入大于25) - 重写@ thomas-m-dubuisson,仍然使用默认的GHC后端,执行时间下降到5.2s
- 重写@ thomas-m-dubuisson,现在使用LLVM后端(ghc选项
-O2 -fllvm
),执行时间降至1.7 s
- 使用ghc选项
-O2 -fllvm -optlc -mcpu = native
带来了0.73s
- 使用ghc选项
- 用
iterate替换
iterate
$ @cirdec的c $ c>在托马斯的代码被使用时(使用默认的'本地'后端和LLVM)没有什么区别。但是,当使用基线代码时, 是有区别的。 从100x到8x到1.09x,即只比C慢9%!
$ p $ b LLVM后端到Ghc 8.0。 2需要LLVM 3.7。在Mac OS X上,这意味着使用brew
然后符号链接opt
和llc
。请参阅 7.10。 GHC Backends 。解决方案
对于初学者,我在Intel I5〜2.5GHz,linux x86-64上使用GHC 8.0.1。
First草稿:哦,不!
$ b 您的参数25的起始代码运行:
%ghc -O2 orig.hs&&时间./orig 25
[1的1]编译Main(orig.hs,orig.o)
链接原点...
OK
./orig 25 7.25s user 0.50s系统99%cpu 7.748总计
所以击败的时间是77ms - 两个数量级更好比这个Haskell代码。让我们潜水吧。
问题1:Shifty Code
与代码的怪异之处。首先是在高性能代码中使用
shift
。 Shift支持左右偏移,并且这样做需要分支。让我们用两个可读性更强的方式来杀死它(例如shift 1 x
〜>2 ^ x
和<$ c
pre $ x $ c $ x $ $ $ $ $ $ $ $ >%ghc -O2 noShift.hs&& time ./noShift 25
[1的1]编译Main(noShift.hs,noShift.o)
链接noShift ...
OK
./noShift 25 0.64s用户0.00s system 99%cpu 0.637 total
(正如您在评论中指出的那样:是的, 。可能有些古怪的代码阻止了重写规则被触发,结果导致了更糟糕的代码)
问题2:位列表? Int操作节省了一天的时间!
一个变化,一个数量级。好极了。还有什么?那么你有这个尴尬的位置位置列表,你似乎在寻求低效率和/或依靠脆弱的优化。在这一点上我会注意到,对该列表中的任何一个选择进行硬编码会得到非常好的性能(例如
testBit lsfr 24`xor` testBit lsfr 21
),但是我们希望有一个更通用的快速解决方案。
我建议我们计算所有的抽头位置的掩码,然后执行单指令弹出计数。要做到这一点,我们只需要传入
advance
而不是整个列表中的单个Int
。 popcount指令需要良好的程序集生成,它需要llvm,可能-optlc-mcpu = native
或另一个非悲观的指令集选择。
下面的步骤给我们
pc
。我已经在注释中提到的advance
的后卫折叠中:
let tp = sum $ map((2 ^)。subtract 1)(tap !! len)
pc lfsr = fromEnum(even(popCount(lfsr。& ;. tp)))
mask = 2 ^ len - 1
advance':: Int - > Int
advance'lfsr =(2 * lfsr。& ;. mask)。|。 pc lfsr
out :: Int $ b $ out = last $ take(2 ^ len)$ iterate advance'0
我们的结果表现如下:
%ghc -O2 so.hs -fforce-recomp -fllvm -optlc-mcpu = native&& time ./so 25
[1的1]编译Main(so.hs,so.o)
链接所以...
OK
./so 25 0.06s用户0.00s系统96%cpu 0.067总额
从头到尾有两个数量级,很有希望它与你的C相匹配。最后,在部署的代码中,实际上Haskell包含C绑定是非常常见的,但这通常是一个教育练习,所以我希望你玩得开心。
编辑:现在可用的C ++代码需要我的系统0.10(
g ++ -O3
)和0.12(clang ++ -O3 -march = native
)秒,所以看起来我们已经超过了我们的分数。Although I have a good LSFR C implementation I thought I'd try the same in Haskell - just to see how it goes. What I came up with, so far, is two orders of magnitude slower than the C implementation, which begs the question: How can the performance be improved? Clearly, the bit-fiddling operations are the bottleneck, and the profiler confirms this.
Here's the baseline Haskell code using lists and
Data.Bits
:import Control.Monad (when) import Data.Bits (Bits, shift, testBit, xor, (.&.), (.|.)) import System.Environment (getArgs) import System.Exit (exitFailure, exitSuccess) tap :: [[Int]] tap = [ [], [], [], [3, 2], [4, 3], [5, 3], [6, 5], [7, 6], [8, 6, 5, 4], [9, 5], [10, 7], [11, 9], [12, 6, 4, 1], [13, 4, 3, 1], [14, 5, 3, 1], [15, 14], [16,15,13,4], [17, 14], [18, 11], [19, 6, 2, 1], [20, 17], [21, 19], [22, 21], [23, 18], [24,23,22,17], [25, 22], [26, 6, 2, 1], [27, 5, 2, 1], [28, 25], [29, 27], [30, 6, 4, 1], [31, 28], [32,22,2,1], [33,20], [34,27,2,1], [35,33], [36,25], [37,5,4,3,2,1],[38,6,5,1], [39,35], [40,38,21,19], [41,38], [42,41,20,19], [43,42,38,37], [44,43,18,17], [45,44,42,41], [46,45,26,25], [47,42], [48,47,21,20], [49,40], [50,49,24,23], [51,50,36,35], [52,49], [53,52,38,37], [54,53,18,17], [55,31], [56,55,35,34], [57,50], [58,39], [59,58,38,37], [60,59], [61,60,46,45], [62,61,6,5], [63,62] ] xor' :: [Bool] -> Bool xor' = foldr xor False mask :: (Num a, Bits a) => Int -> a mask len = shift 1 len - 1 advance :: Int -> [Int] -> Int -> Int advance len tap lfsr | d0 = shifted | otherwise = shifted .|. 1 where shifted = shift lfsr 1 .&. mask len d0 = xor' $ map (testBit lfsr) tap' tap' = map (subtract 1) tap main :: IO () main = do args <- getArgs when (null args) $ fail "Usage: lsfr <number-of-bits>" let len = read $ head args when (len < 8) $ fail "No need for LFSR" let out = last $ take (shift 1 len) $ iterate (advance len (tap!!len)) 0 if out == 0 then do putStr "OK\n" exitSuccess else do putStr "FAIL\n" exitFailure
Basically it tests whether the LSFR defined in
tap :: [[Int]]
for any given bit-length is of maximum-length. (More precisely, it just checks whether the LSFR reaches the initial state (zero) after 2n iterations.)According to the profiler the most costly line is the feedback bit
d0 = xor' $ map (testBit lfsr) tap'
.What I've tried so far:
- use
Data.Array
: Attempt abandoned because there's no foldl/r - use
Data.Vector
: Slightly faster than the baseline
The compiler options I use are:
-O2
,LTS Haskell 8.12 (ghc-8.0.2)
.The reference C++ program can be found on gist.github.com.
The Haskell code can't be expected (?) to run as fast as the C code, but two orders of magnitude is too much, there must be a better way to do the bit-fiddling.
Update: Results of applying the optimisations suggested in the answers
- The reference C++ program with input 28, compiled with LLVM 8.0.0, runs in 0.67s on my machine (the same with clang 3.7 is marginally slower, 0.68s)
- The baseline Haskell code runs in about 100x slower (because of the space inefficiency don't try it with inputs larger than 25)
- With the rewrite of @thomas-m-dubuisson, still using the default GHC backend, the execution time goes down to 5.2s
- With the rewrite of @thomas-m-dubuisson, now using the LLVM backend (ghc option
-O2 -fllvm
), the execution time goes down to 1.7s- Using ghc option
-O2 -fllvm -optlc -mcpu=native
brings this to 0.73s
- Using ghc option
- Replacing
iterate
withiterate'
of @cirdec makes no difference when Thomas' code is used (both with the default 'native' backend and LLVM). However, it does make a difference when the baseline code is used.
So, we've come from 100x to 8x to 1.09x, i.e. only 9% slower than C!
Note The LLVM backend to Ghc 8.0.2 requires LLVM 3.7. On Mac OS X this means installing this version with
brew
and then symlinkingopt
andllc
. See 7.10. GHC Backends.解决方案Up Front Matters
For starters, I'm using GHC 8.0.1 on an Intel I5 ~2.5GHz, linux x86-64.
First Draft: Oh No! The slows!
Your starting code with parameter 25 runs:
% ghc -O2 orig.hs && time ./orig 25 [1 of 1] Compiling Main ( orig.hs, orig.o ) Linking orig ... OK ./orig 25 7.25s user 0.50s system 99% cpu 7.748 total
So the time to beat is 77ms - two orders of magnitude better than this Haskell code. Lets dive in.
Issue 1: Shifty Code
I found a couple of oddities with the code. First was the use of
shift
in high performance code. Shift supports both left and right shift and to do so it requires a branch. Lets kill that with more readable powers of two and such (shift 1 x
~>2^x
andshift x 1
~>2*x
):% ghc -O2 noShift.hs && time ./noShift 25 [1 of 1] Compiling Main ( noShift.hs, noShift.o ) Linking noShift ... OK ./noShift 25 0.64s user 0.00s system 99% cpu 0.637 total
(As you noted in the comments: Yes, this bears investigation. It might be that some oddity of the prior code was preventing a rewrite rule from firing and, as a result, much worse code resulted)
Issue 2: Lists Of Bits? Int operations save the day!
One change, one order of magnitude. Yay. What else? Well you have this awkward list of bit locations you're tapping that just seems like its begging for inefficiency and/or leans on fragile optimizations. At this point I'll note that hard-coding any one selection from that list results in really good performance (such as
testBit lsfr 24 `xor` testBit lsfr 21
) but we want a more general fast solution.I propose we compute the mask of all the tap locations then do a one-instruction pop count. To do this we only need a single
Int
passed in toadvance
instead of a whole list. The popcount instruction requires good assembly generation which requires llvm and probably-optlc-mcpu=native
or another instruction set selection that is non-pessimistic.This step gives us
pc
below. I've folded in the guard-removal ofadvance
that was mentioned in the comments:let tp = sum $ map ((2^) . subtract 1) (tap !! len) pc lfsr = fromEnum (even (popCount (lfsr .&. tp))) mask = 2^len - 1 advance' :: Int -> Int advance' lfsr = (2*lfsr .&. mask) .|. pc lfsr out :: Int out = last $ take (2^len) $ iterate advance' 0
Our resulting performance is:
% ghc -O2 so.hs -fforce-recomp -fllvm -optlc-mcpu=native && time ./so 25 [1 of 1] Compiling Main ( so.hs, so.o ) Linking so ... OK ./so 25 0.06s user 0.00s system 96% cpu 0.067 total
That's over two orders of magnitude from start to finish, so hopefully it matches your C. Finally, in deployed code it is actually really common to have Haskell packages with C bindings but this is often an educational exercise so I hope you had fun.
Edit: The now-available C++ code takes my system 0.10 (
g++ -O3
) and 0.12 (clang++ -O3 -march=native
) seconds, so it seems we've beat our mark by a fair bit.这篇关于在LFSR实现中有效的位摆动的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- use