如何使我的Haskell程序更快?与C的比较 [英] How to make my Haskell program faster? Comparison with C

查看:138
本文介绍了如何使我的Haskell程序更快?与C的比较的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个SHA3候选人之一JH。我正处在算法通过所有由NIST提供的KAT(已知答案测试)的地步,并且也使它成为了Crypto-API的一个实例。因此我开始研究它的表现。但对于Haskell我很新,在分析时并不知道要查找什么。

目前我的代码一直比使用C编写的参考实现要慢,所有输入长度的系数都是10(C代码在这里找到:,而不会引起边界检查和数据依赖索引(即

  • 解开 Block1024 > Block512 (或至少使用 UnboxedTuples

  • 使用 unsafeShift {R ,L} ,所以你不会检查shift值(来自GHC 7.4)
  • 展开 roundFunction
  • code>,所以你有一个相当丑陋和详细的 e8 函数。这在pureMD5中很有意义(滚动的版本比漂亮的版本更漂亮,但速度更慢)。您可能可以使用TH执行此操作并保持代码小。如果你这样做,那么你将不需要常量,因为这些值将在代码中显式化,并产生更容易缓存的二进制文件。
  • 解开你的 Word128 值。 $ c>,不要提取整数。有关 an LargeWord示例

  • >
  • 使用优化编译( -O2 )并尝试llvm( -fllvm


    编辑:将你的git repo与基准一起打包,这样我们可以帮助你更轻松;-)。在包含crypto-api实例方面做得很好。


    I'm working on an implementation of one of the SHA3 candidates, JH. I'm at the point where the algorithm pass all KATs (Known Answer Tests) provided by NIST, and have also made it an instance of the Crypto-API. Thus I have began looking into its performance. But I'm quite new to Haskell and don't really know what to look for when profiling.

    At the moment my code is consistently slower then the reference implementation written in C, by a factor of 10 for all input lengths (C code found here: http://www3.ntu.edu.sg/home/wuhj/research/jh/jh_bitslice_ref64.h).

    My Haskell code is found here: https://github.com/hakoja/SHA3/blob/master/Data/Digest/JHInternal.hs.

    Now I don't expect you to wade through all my code, rather I would just want some tips on a couple of functions. I have run some performance tests and this is (part of) the performance file generated by GHC:

    Tue Oct 25 19:01 2011 Time and Allocation Profiling Report  (Final)
    
       main +RTS -sstderr -p -hc -RTS jh e False
    
    total time  =        6.56 secs   (328 ticks @ 20 ms)
    total alloc = 4,086,951,472 bytes  (excludes profiling overheads)
    
    COST CENTRE                    MODULE               %time %alloc
    
    roundFunction                  Data.Digest.JHInternal  28.4   37.4
    word128Shift                   Data.BigWord.Word128  14.9   19.7
    blockMap                       Data.Digest.JHInternal  11.9   12.9
    getBytes                       Data.Serialize.Get     6.7    2.4
    unGet                          Data.Serialize.Get     5.5    1.3
    sbox                           Data.Digest.JHInternal   4.0    7.4
    getWord64be                    Data.Serialize.Get     3.7    1.6
    e8                             Data.Digest.JHInternal   3.7    0.0
    swap4                          Data.Digest.JHInternal   3.0    0.7
    swap16                         Data.Digest.JHInternal   3.0    0.7
    swap8                          Data.Digest.JHInternal   1.8    0.7
    swap32                         Data.Digest.JHInternal   1.8    0.7
    parseBlock                     Data.Digest.JHInternal   1.8    1.2
    swap2                          Data.Digest.JHInternal   1.5    0.7
    swap1                          Data.Digest.JHInternal   1.5    0.7
    linearTransform                Data.Digest.JHInternal   1.5    8.6
    shiftl_w64                     Data.Serialize.Get     1.2    1.1
    
    Detailed breakdown omitted ...
    

    Now quickly about the JH algorithm:

    It's a hash algorithm which consists of a compression function F8, which is repeated as long as there exists input blocks (of length 512 bits). This is just how the SHA-functions operate. The F8 function consists of the E8 function which applies a round function 42 times. The round function itself consists of three parts: a sbox, a linear transformation and a permutation (called swap in my code).

    Thus it's reasonable that most of the time is spent in the round function. Still I would like to know how those parts could be improved. For instance: the blockMap function is just a utility function, mapping a function over the elements in a 4-tuple. So why is it performing so badly? Any suggestions would be welcome, and not just on single functions, i.e. are there structural changes you would have done in order to improve the performance?

    I have tried looking at the Core output, but unfortunately that's way over my head.

    I attach some of the heap profiles at the end as well in case that could be of interest.

    EDIT :

    I forgot to mention my setup and build. I run it on a x86_64 Arch Linux machine, GHC 7.0.3-2 (I think), with compile options:

    ghc --make -O2 -funbox-strict-fields

    Unfortunately there seems to be a bug on the Linux plattform when compiling via C or LLVM, giving me the error:

    Error: .size expression for XXXX does not evaluate to a constant

    so I have not been able to see the effect of that.

    解决方案

    • Switch to unboxed Vectors (from Array, used for constants)
    • Use unsafeIndex instead of incurring the bounds check and data dependency from safe indexing (i.e. !)
    • Unpack Block1024 as you did with Block512 (or at least use UnboxedTuples)
    • Use unsafeShift{R,L} so you don't incur the check on the shift value (coming in GHC 7.4)
    • Unfold the roundFunction so you have one rather ugly and verbose e8 function. This was significat in pureMD5 (the rolled version was prettier but massively slower than the unrolled version). You might be able to use TH to do this and keep the code smallish. If you do this then you'll have no need for constants as these values will be explicit in the code and result in a more cache friendly binary.
    • Unpack your Word128 values.
    • Define your own addition for Word128, don't lift Integer. See LargeWord for an example of how this can be done.
    • rem not mod
    • Compile with optimization (-O2) and try llvm (-fllvm)

    EDIT: And cabalize your git repo along with a benchmark so we can help you easier ;-). Good work on including a crypto-api instance.

    这篇关于如何使我的Haskell程序更快?与C的比较的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆