优化Haskell内部循环 [英] Optimizing Haskell Inner Loops

查看:84
本文介绍了优化Haskell内部循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

仍在使用Haskell中的SHA1实现。我已经得到了一个工作实现,这是内部循环:

  iterateBlock':: Int  - > [Word32]  - > Word32  - > Word32  - > Word32  - > Word32  - > Word32  - > [字32] 
iterateBlock'80 ws abcde = [a,b,c,d,e]
iterateBlock't(w:ws)abcde = iterateBlock'(t + 1)ws a'b' c'd'e'
其中
a'=旋转5 + ftbcd + e + w + kt
b'= a
c'=旋转b 30
'= c
e'= d

分析器告诉我这个函数需要我实现的运行时间的1/3。我可以想象没有办法进一步优化它,除了可能内联临时变量,但我相信-O2无论如何会为我做到这一点。



任何人都可以看到可以进一步应用的显着优化吗?



FYI k和f调用低于。他们非常简单,我认为没有办法优化这些。除非Data.Bits模块很慢?

  f :: Int  - > Word32  - > Word32  - > Word32  - > Word32 
f t b c d
| t≤19=(b。& c)。|。 ((补数b)。& d)
| t< = 39 = b`xor` c`xor` d
| t <= 59 =(b。& c)。|。 (b。& d)。|。 (c。& ;. d)
|否则= b`xor` c`xor` d

k :: Int - > Word32
k t
| t <= 19 = 0x5A827999
| t <= 39 = 0x6ED9EBA1
| t <= 59 = 0x8F1BBCDC
|否则= 0xCA62C1D6


解决方案

7.2.2,内联运作良好。没有效果的是,在每次迭代中,首先将一些 Word32 值解开,以执行工作,然后重新进行下一次迭代。拆箱和重新装箱会花费惊人的大量时间(和分配)。
您可以避免使用 Word 而不是 Word32 。你不能在Data.Bits中使用 rotate ,但是你必须自己实现它(不难),以使它在64位系统上也能正常工作。对于 a',您必须手动屏蔽掉高位。



看起来不理想的另一点是,每次迭代 t 与19,39和59(如果足够大)进行比较,以便循环体包含四个分支。如果将 iterateBlock'分成四个循环(0-19,20-39,40-59,60-79)并使用常量k1,则它可能会更快。 ...,k4和四个函数f1,...,f4(没有 t 参数)以避免分支并且每个循环都有较小的代码大小。



而且,正如托马斯所说,使用块数据的列表并不是最优的,一个未装箱的Word数组/矢量也可能有帮助。



随着爆炸模式,核心看起来好多了。

 (GHC.Prim.narrow32Word#
(GHC.Prim .plusWord#
(GHC.Prim.narrow32Word#
(GHC.Prim.plusWord#
(GHC.Prim.narrow32Word#
(GHC.Prim.plusWord#$ b $ (GHC.Prim.narrow32Word#
(GHC.Prim.plusWord#
(GHC.Prim.narrow32Word#
(GHC.Prim.or#
(GHC.Prim。 uncheckedShiftL#sc2_sEn 5)
(GHC.Prim.uncheckedShiftRL#sc2_sEn 27)))
y#_aBw))
sc6_sEr))

查看所有这些 narrow32Word#?他们很便宜,但不是免费的。只需要最外面的部分,通过手工编码步骤并使用 Word 可能有点收获。



然后比较 t 和19,...,它们出现两次,一次确定 k 常量,一次用于 f 转换。单单比较便宜,但它们会导致分支,如果没有它们,则可能会进一步内联。我希望在这里也可以获得一点。



还有,这份名单。这意味着 w 不能拆箱,如果 w 不可拆卸,核心可能会更简单。

b $ b

Still working on my SHA1 implementation in Haskell. I've now got a working implementation and this is the inner loop:

iterateBlock' :: Int -> [Word32] -> Word32 -> Word32 -> Word32 -> Word32 -> Word32 -> [Word32]
iterateBlock' 80 ws a b c d e    = [a, b, c, d, e]
iterateBlock' t (w:ws) a b c d e = iterateBlock' (t+1) ws a' b' c' d' e'
    where
    a' = rotate a 5 + f t b c d + e + w + k t
    b' = a
    c' = rotate b 30
    d' = c
    e' = d

The profiler tells me that this function takes 1/3 of the runtime of my implementation. I can think of no way to further optimize it other than maybe inlining the temp variables but I believe -O2 will do that for me anyway.

Can anyone see a significant optimization that can be further applied?

FYI the k and f calls are below. They are so simple I don't think there is a way to optimize these other. Unless the Data.Bits module is slow?

f :: Int -> Word32 -> Word32 -> Word32 -> Word32
f t b c d
    | t <= 19   = (b .&. c) .|. ((complement b) .&. d)
    | t <= 39   = b `xor` c `xor` d
    | t <= 59   = (b .&. c) .|. (b .&. d) .|. (c .&. d)
    | otherwise = b `xor` c `xor` d

k :: Int -> Word32
k t
    | t <= 19   = 0x5A827999
    | t <= 39   = 0x6ED9EBA1
    | t <= 59   = 0x8F1BBCDC
    | otherwise = 0xCA62C1D6

解决方案

Looking at the core produced by ghc-7.2.2, the inlining works out well. What doesn't work so well is that in each iteration a couple of Word32 values are first unboxed, to perform the work, and then reboxed for the next iteration. Unboxing and re-boxing can cost a surprisingly large amount of time (and allocation). You can probably avoid that by using Word instead of Word32. You couldn't use rotate from Data.Bits then, but would have to implement it yourself (not hard) to have it work also on 64-bit systems. For a' you would have to manually mask out the high bits.

Another point that looks suboptimal is that in each iteration t is compared to 19, 39 and 59 (if it's large enough), so that the loop body contains four branches. It will probably be faster if you split iterateBlock' into four loops (0-19, 20-39, 40-59, 60-79) and use constants k1, ..., k4, and four functions f1, ..., f4 (without the t parameter) to avoid branches and have smaller code-size for each loop.

And, as Thomas said, using a list for the block data isn't optimal, an unboxed Word array/vector would probably help too.

With the bang patterns, the core looks much better. Two or three less-than-ideal points remain.

                      (GHC.Prim.narrow32Word#
                         (GHC.Prim.plusWord#
                            (GHC.Prim.narrow32Word#
                               (GHC.Prim.plusWord#
                                  (GHC.Prim.narrow32Word#
                                     (GHC.Prim.plusWord#
                                        (GHC.Prim.narrow32Word#
                                           (GHC.Prim.plusWord#
                                              (GHC.Prim.narrow32Word#
                                                 (GHC.Prim.or#
                                                    (GHC.Prim.uncheckedShiftL# sc2_sEn 5)
                                                    (GHC.Prim.uncheckedShiftRL# sc2_sEn 27)))
                                              y#_aBw))
                                        sc6_sEr))
                                  y#1_XCZ))
                            y#2_XD6))

See all these narrow32Word#? They're cheap, but not free. Only the outermost is needed, there may be a bit to harvest by hand-coding the steps and using Word.

Then the comparisons of t with 19, ..., they appear twice, once to determine the k constant, and once for the f transform. The comparisons alone are cheap, but they cause branches and without them, further inlining may be possible. I expect a bit could be gained here too.

And still, the list. That means w can't be unboxed, the core could be simpler if w were unboxable.

这篇关于优化Haskell内部循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆