在LFSR实现中有效的位摆动 [英] Efficient bit-fiddling in a LFSR implementation

查看:172
本文介绍了在LFSR实现中有效的位摆动的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

虽然我有一个很好的LSFR C实现,但我认为我会在Haskell中尝试相同的方法 - 只是为了看看它是如何实现的。到目前为止,我想到的是比C实现慢两个数量级,这引出了一个问题:性能如何提高?显然,位摆动操作是瓶颈,分析器证实了这一点。



以下是使用列表和 Data.Bits 的基线Haskell代码:



$ $ p $ import Control.Monad(when)
import Data.Bits(Bits,shift,testBit,xor,(。& ;.) ,(。|。))
import System.Environment(getArgs)
import System.Exit(exitFailure,exitSuccess)

tap :: [[Int]]
tap = [
[],[],[],[3,2],
[4,3],[5,3],[6,5],[7,6] ,
[8,6,5,4],[9,5],[10,7],[11,9],
[12,6,4,1],[13, 4,3,1],[14,5,3,1],[15,14],
[16,15,13,​​4],[17,14],[18,11],[ 19,6,2,1],
[20, 17,21,23,28,23,18],
[24,23,22,17],[25,22],[26,6,2,1] ,[27,5,2,1],
[28,25],[29,27],[30,6,4,1],[31,28],
[32, 22,2,1],[33,20],[34,27,2,1],[35,33],
[36,25],[37,5,4,3,2, 1],[38,6,5,1],[39,35],
[40,38,21,19],[41,38],[42,41,20,19],[ 43,42,38,37],
[44,43,18,17],[45,44,42,41],[46,45,26,25],[47,42],
[48,47,21,20],[49,40],[50,49,24,23],[51,50,36,35],
[52,49],[ 53,52,38,37],[54,53,18,17],[55,31],
[56,55,35,34],[57,50],[58,39] ,[59,58,38,37],
[60,59],[61,60,46,45],[62,61,6,5],[63,62]]

xor':: [Bool] - > Bool
xor'= foldr xor False

mask ::(Num a,Bits a)=> Int - > a
掩码len = shift 1 len - 1

advance :: Int - > [Int] - > Int - > Int
advance len tap lfsr
| d0 =移位
|否则=移动。|。 1
其中
转移= shift lfsr 1。& ;.掩码len
d0 = xor'$ map(testBit lfsr)点击'
tap'=地图(减1)点击

main :: IO()
main = do
args< - getArgs
when(null args)$ fail用法:lsfr<比特数>
let len = read $ head args
when(len <8)$ fail不需要LFSR
let out = last $ take(shift 1 len)$ iterate(advance len (tap !! len))0
if out == 0 then do
putStrOK\\\

exitSuccess
else do
putStrFAIL\ n
exitFailure

基本上它会测试 tap :: [[Int]] 中定义的LSFR 对于任何给定的位长度是最大长度。 (更确切地说,它只是检查LSFR是否在2 n 迭代后达到初始状态(零)。)根据探查器,最昂贵的线是反馈位 d0 = xor'$ map(testBit lfsr)点击'



至今已尝试过:


  • 使用 Data.Array :尝试被放弃,因为有没有foldl / r

  • 使用 Data.Vector :略快于基线


我使用的编译器选项是: -O2 LTS Haskell 8.12(ghc-8.0.2)



参考C ++程序可以在 gist.github.com

Haskell代码不能像C那样快代码,但两个数量级太多了,必须有更好的方法来做这个小小的摆弄。



更新:在答案中应用优化的结果


  • 输入28的参考C ++程序使用LLVM 8.0.0编译,在我的机器上以0.67s运行(与clang 3.7相同,速度稍慢,为0.68s)
    基准Haskell代码运行速度大约慢100倍(因为空间效率不高,请不要尝试它与输入大于25)

  • 重写@ thomas-m-dubuisson,仍然使用默认的GHC后端,执行时间下降到5.2s

  • 重写@ thomas-m-dubuisson,现在使用LLVM后端(ghc选项 -O2 -fllvm ),执行时间降至1.7 s


    • 使用ghc选项 -O2 -fllvm -optlc -mcpu = native 带来了0.73s


  • iterate替换 iterate 在托马斯的代码被使用时(使用默认的'本地'后端和LLVM)没有什么区别。但是,当使用基线代码时, 是有区别的。 从100x到8x到1.09x,即只比C慢9%!

    $ p $ b LLVM后端到Ghc 8.0。 2需要LLVM 3.7。在Mac OS X上,这意味着使用 brew 然后符号链接 opt llc 。请参阅 7.10。 GHC Backends

    解决方案


    对于初学者,我在Intel I5〜2.5GHz,linux x86-64上使用GHC 8.0.1。

    First草稿:哦,不!
    $ b 您的参数25的起始代码运行:

     %ghc -O2 orig.hs&&时间./orig 25 
    [1的1]编译Main(orig.hs,orig.o)
    链接原点...
    OK
    ./orig 25 7.25s user 0.50s系统99%cpu 7.748总计

    所以击败的时间是77ms - 两个数量级更好比这个Haskell代码。让我们潜水吧。



    问题1:Shifty Code



    与代码的怪异之处。首先是在高性能代码中使用 shift 。 Shift支持左右偏移,并且这样做需要分支。让我们用两个可读性更强的方式来杀死它(例如 shift 1 x 〜> 2 ^ x 和<$ c

    pre $ x $ c $ x $ $ $ $ $ $ $ $ > %ghc -O2 noShift.hs&& time ./noShift 25
    [1的1]编译Main(noShift.hs,noShift.o)
    链接noShift ...
    OK
    ./noShift 25 0.64s用户0.00s system 99%cpu 0.637 total

    (正如您在评论中指出的那样:是的, 。可能有些古怪的代码阻止了重写规则被触发,结果导致了更糟糕的代码)

    问题2:位列表? Int操作节省了一天的时间!



    一个变化,一个数量级。好极了。还有什么?那么你有这个尴尬的位置位置列表,你似乎在寻求低效率和/或依靠脆弱的优化。在这一点上我会注意到,对该列表中的任何一个选择进行硬编码会得到非常好的性能(例如 testBit lsfr 24`xor` testBit lsfr 21 ),但是我们希望有一个更通用的快速解决方案。



    我建议我们计算所有的抽头位置的掩码,然后执行单指令弹出计数。要做到这一点,我们只需要传入 advance 而不是整个列表中的单个 Int 。 popcount指令需要良好的程序集生成,它需要llvm,可能 -optlc-mcpu = native 或另一个非悲观的指令集选择。



    下面的步骤给我们 pc 。我已经在注释中提到的 advance 的后卫折叠中:

      let tp = sum $ map((2 ^)。subtract 1)(tap !! len)
    pc lfsr = fromEnum(even(popCount(lfsr。& ;. tp)))
    mask = 2 ^ len - 1
    advance':: Int - > Int
    advance'lfsr =(2 * lfsr。& ;. mask)。|。 pc lfsr
    out :: Int $ b $ out = last $ take(2 ^ len)$ iterate advance'0

    我们的结果表现如下:

     %ghc -O2 so.hs -fforce-recomp -fllvm -optlc-mcpu = native&& time ./so 25 
    [1的1]编译Main(so.hs,so.o)
    链接所以...
    OK
    ./so 25 0.06s用户0.00s系统96%cpu 0.067总额

    从头到尾有两个数量级,很有希望它与你的C相匹配。最后,在部署的代码中,实际上Haskell包含C绑定是非常常见的,但这通常是一个教育练习,所以我希望你玩得开心。



    编辑:现在可用的C ++代码需要我的系统0.10( g ++ -O3 )和0.12( clang ++ -O3 -march = native )秒,所以看起来我们已经超过了我们的分数。


    Although I have a good LSFR C implementation I thought I'd try the same in Haskell - just to see how it goes. What I came up with, so far, is two orders of magnitude slower than the C implementation, which begs the question: How can the performance be improved? Clearly, the bit-fiddling operations are the bottleneck, and the profiler confirms this.

    Here's the baseline Haskell code using lists and Data.Bits:

    import           Control.Monad      (when)
    import           Data.Bits          (Bits, shift, testBit, xor, (.&.), (.|.))
    import           System.Environment (getArgs)
    import           System.Exit        (exitFailure, exitSuccess)
    
    tap :: [[Int]]
    tap = [
        [],            [],            [],            [3, 2],
        [4, 3],        [5, 3],        [6, 5],        [7, 6],
        [8, 6, 5, 4],  [9, 5],        [10, 7],       [11, 9],
        [12, 6, 4, 1], [13, 4, 3, 1], [14, 5, 3, 1], [15, 14],
        [16,15,13,4],  [17, 14],      [18, 11],      [19, 6, 2, 1],
        [20, 17],      [21, 19],      [22, 21],      [23, 18],
        [24,23,22,17], [25, 22],      [26, 6, 2, 1], [27, 5, 2, 1],
        [28, 25],      [29, 27],      [30, 6, 4, 1], [31, 28],
        [32,22,2,1],   [33,20],       [34,27,2,1],   [35,33],
        [36,25],       [37,5,4,3,2,1],[38,6,5,1],    [39,35],
        [40,38,21,19], [41,38],       [42,41,20,19], [43,42,38,37],
        [44,43,18,17], [45,44,42,41], [46,45,26,25], [47,42],
        [48,47,21,20], [49,40],       [50,49,24,23], [51,50,36,35],
        [52,49],       [53,52,38,37], [54,53,18,17], [55,31],
        [56,55,35,34], [57,50],       [58,39],       [59,58,38,37],
        [60,59],       [61,60,46,45], [62,61,6,5],   [63,62]        ]
    
    xor' :: [Bool] -> Bool
    xor' = foldr xor False
    
    mask ::  (Num a, Bits a) => Int -> a
    mask len = shift 1 len - 1
    
    advance :: Int -> [Int] -> Int -> Int
    advance len tap lfsr
        | d0        = shifted
        | otherwise = shifted .|. 1
        where
            shifted = shift lfsr 1 .&. mask len
            d0 = xor' $ map (testBit lfsr) tap'
            tap' = map (subtract 1) tap
    
    main :: IO ()
    main = do
        args <- getArgs
        when (null args) $ fail "Usage: lsfr <number-of-bits>"
        let len = read $ head args
        when (len < 8) $ fail "No need for LFSR"
        let out = last $ take (shift 1 len) $ iterate (advance len (tap!!len)) 0
        if out == 0 then do
            putStr "OK\n"
            exitSuccess
        else do
            putStr "FAIL\n"
            exitFailure
    

    Basically it tests whether the LSFR defined in tap :: [[Int]] for any given bit-length is of maximum-length. (More precisely, it just checks whether the LSFR reaches the initial state (zero) after 2n iterations.)

    According to the profiler the most costly line is the feedback bit d0 = xor' $ map (testBit lfsr) tap'.

    What I've tried so far:

    • use Data.Array: Attempt abandoned because there's no foldl/r
    • use Data.Vector: Slightly faster than the baseline

    The compiler options I use are: -O2, LTS Haskell 8.12 (ghc-8.0.2).

    The reference C++ program can be found on gist.github.com.

    The Haskell code can't be expected (?) to run as fast as the C code, but two orders of magnitude is too much, there must be a better way to do the bit-fiddling.

    Update: Results of applying the optimisations suggested in the answers

    • The reference C++ program with input 28, compiled with LLVM 8.0.0, runs in 0.67s on my machine (the same with clang 3.7 is marginally slower, 0.68s)
    • The baseline Haskell code runs in about 100x slower (because of the space inefficiency don't try it with inputs larger than 25)
    • With the rewrite of @thomas-m-dubuisson, still using the default GHC backend, the execution time goes down to 5.2s
    • With the rewrite of @thomas-m-dubuisson, now using the LLVM backend (ghc option -O2 -fllvm), the execution time goes down to 1.7s
      • Using ghc option -O2 -fllvm -optlc -mcpu=native brings this to 0.73s
    • Replacing iterate with iterate' of @cirdec makes no difference when Thomas' code is used (both with the default 'native' backend and LLVM). However, it does make a difference when the baseline code is used.

    So, we've come from 100x to 8x to 1.09x, i.e. only 9% slower than C!

    Note The LLVM backend to Ghc 8.0.2 requires LLVM 3.7. On Mac OS X this means installing this version with brew and then symlinking opt and llc. See 7.10. GHC Backends.

    解决方案

    Up Front Matters

    For starters, I'm using GHC 8.0.1 on an Intel I5 ~2.5GHz, linux x86-64.

    First Draft: Oh No! The slows!

    Your starting code with parameter 25 runs:

    % ghc -O2 orig.hs && time ./orig 25
    [1 of 1] Compiling Main             ( orig.hs, orig.o )
    Linking orig ...
    OK
    ./orig 25  7.25s user 0.50s system 99% cpu 7.748 total
    

    So the time to beat is 77ms - two orders of magnitude better than this Haskell code. Lets dive in.

    Issue 1: Shifty Code

    I found a couple of oddities with the code. First was the use of shift in high performance code. Shift supports both left and right shift and to do so it requires a branch. Lets kill that with more readable powers of two and such (shift 1 x ~> 2^x and shift x 1 ~> 2*x):

    % ghc -O2 noShift.hs && time ./noShift 25
    [1 of 1] Compiling Main             ( noShift.hs, noShift.o )
    Linking noShift ...
    OK
    ./noShift 25  0.64s user 0.00s system 99% cpu 0.637 total
    

    (As you noted in the comments: Yes, this bears investigation. It might be that some oddity of the prior code was preventing a rewrite rule from firing and, as a result, much worse code resulted)

    Issue 2: Lists Of Bits? Int operations save the day!

    One change, one order of magnitude. Yay. What else? Well you have this awkward list of bit locations you're tapping that just seems like its begging for inefficiency and/or leans on fragile optimizations. At this point I'll note that hard-coding any one selection from that list results in really good performance (such as testBit lsfr 24 `xor` testBit lsfr 21) but we want a more general fast solution.

    I propose we compute the mask of all the tap locations then do a one-instruction pop count. To do this we only need a single Int passed in to advance instead of a whole list. The popcount instruction requires good assembly generation which requires llvm and probably -optlc-mcpu=native or another instruction set selection that is non-pessimistic.

    This step gives us pc below. I've folded in the guard-removal of advance that was mentioned in the comments:

    let tp = sum $ map ((2^) . subtract 1) (tap !! len)
        pc lfsr = fromEnum (even (popCount (lfsr .&. tp)))
        mask = 2^len - 1
        advance' :: Int -> Int
        advance' lfsr = (2*lfsr .&. mask) .|. pc lfsr 
        out :: Int
        out = last $ take (2^len) $ iterate advance' 0
    

    Our resulting performance is:

    % ghc -O2 so.hs -fforce-recomp -fllvm -optlc-mcpu=native && time ./so 25      
    [1 of 1] Compiling Main             ( so.hs, so.o )
    Linking so ...
    OK
    ./so 25  0.06s user 0.00s system 96% cpu 0.067 total
    

    That's over two orders of magnitude from start to finish, so hopefully it matches your C. Finally, in deployed code it is actually really common to have Haskell packages with C bindings but this is often an educational exercise so I hope you had fun.

    Edit: The now-available C++ code takes my system 0.10 (g++ -O3) and 0.12 (clang++ -O3 -march=native) seconds, so it seems we've beat our mark by a fair bit.

    这篇关于在LFSR实现中有效的位摆动的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆