如何在Haskell中进行快速数据反序列化 [英] How to do fast data deserialization in Haskell

查看:97
本文介绍了如何在Haskell中进行快速数据反序列化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

基准测试显示,谷物库对我的数据结构进行反序列化所花费的时间要比从驱动器中读取相同数据所花的时间长100倍(详述如下):

p>

 基准点读取
的平均值:465.7050美元,磅460.9873美元,ub 471.0938美元,ci 0.950
std dev: 25.79706美元,磅22.19820美元,ub 30.81870美元,ci 0.950美元b $ b发现100个样本中有4个异常值(4.0%)
4(4.0%)高温和
异常引入的异常值:53.460%
方差严重夸大b

基准读取+解码
收集100个样本,每个迭代1次,估计值为6.356502 s
平均值:68.85135 ms,lb 67.65992 ms, ub 70.05832 ms,ci 0.950
std dev:6.134430 ms,lb 5.607914 ms,ub 6.755639 ms,ci 0.950
由离群值引入的变异:74.863%
变异被离群值严重夸大

这也支持在程序中分析此数据结构的典型反序列化使用情况其中98%的时间花在反序列化数据上,1%是 IO 加上核心算法:

 成本中心模块%时间%分配

getWord8 Data.Serialize.Get 30.5 40.4
unGet Data.Serialize.Get 29.5 17.9
getWord64be Data.Serialize.Get 14.0 10.7
getListOf Data.Serialize.Get 10.2 12.8
roll Data.Serialize 8.2 11.5
shiftl_w64 Data.Serialize.Get 3.4 2.9
decode Data.Serialize 2.9 3.1
main Main 1.3 0.6

反序列化的数据结构是 IntMap [Triplet Atom] 以及组件类型的定义如下:

 类型三元组a =(a,a,a) 

data Point = Point {
_x :: { - #UNPACK# - }!Double,
_y :: { - #UNPACK# - }!Double,
_z :: { - #UNPACK# - }!Double}

data Atom = Atom {
_serial :: { - #UNPACK# - }!Int,
_r: :{ - #UNPACK# - }!Point,
_n :: { - #UNPACK# - }!Word64}

我使用默认的 IntMap (,,)和<$ c $ $ cereal 提供的实例,以及以下类型和实例:

  instance Serialize Point其中
放(点xyz)= do
放x
放y
放z
get = Point< $>得到< *>得到< *>得到

实例Serialize Atom其中
放(Atom srn)= do
放s
放r
放n
get = Atom< ; $>得到< *>得到< *>得到

所以我的问题是:


  1. 为什么反序列化通常很慢?

  2. 有没有办法改变我的数据结构(即 IntMap / [] )使反序列化更快?

  3. 有没有办法改变我的数据类型(即<$ c
    $ b
  4. 是否有更快的替代方案比原子 / )使反序列化更快? code> cereal ,或者我应该将数据结构存储在C-land中以执行更快的反序列化(即使用 mmap

这些反序列化的文件正在用于搜索引擎的子索引,因为完整索引无法放入内存中目标计算机(这是一个消费级桌面),因此我将每个子索引存储在磁盘上,并读取并解码驻留在内存中的初始全局索引所指向的子索引。另外,我并不关心序列化速度,因为搜索索引是最终用户的瓶颈,当前的 cereal 序列化性能对于生成和更新索引。



编辑:



试验了Don建议使用节省空间的三元组,速度:

 基准点读取
的平均值:468.9671美元,lb 464.2564美元,ub 473.8867美元,ci 0.950
std dev:24.67863 us,lb 21.71392 us,ub 28.39479 us,ci 0.950
在100个样本中发现2个异常值(2.0%)
2(2.0%)高温和
异常:50.474%
差异被离群值严重夸大

基准读取+解码
平均值:15.04670毫秒,磅14.99097毫秒,ub 15.10520 ms,ci 0.950
std dev :292.7815美元,磅278.8742美元,美元308.1960美元,ci 0.950美元b $ b异常值引入的差异:12.303%
差异值被异常值适度夸大

然而,它仍然是使用25倍于IO的瓶颈。另外,有人可以解释为什么唐的建议有效吗?这是否意味着如果我切换到列表(比如数组)之外的其他东西,它可能会改进?



编辑#2:刚刚切换到最新Haskell平台和reran分析谷物。这些信息相当详细,我提供了 hpaste

解决方案

好的。用建议摘要回答这个问题。使用谷物(严格的字符串输出)或 code> binary (lazy bytetring output)

  • 确保使用-O2编译,因为这些库依靠内联来移除开销
  • li>
  • 使用密集数据类型,例如用解压后的专用格式替换多态元组。
  • 避免将数据类型转换为列表以序列化它们。如果你有字节串,这是照顾。对于未打包的数组类型,您通常会获得非常快的IO,但值得双重检查实例。

  • 您可以使用mmap'd IO

  • 对于双重数据,请考虑更高效的双重读取器。

  • 使用适合性能的现代数组和容器类型,并使用更新的GHC版本。 ul>

    Benchmarking shows that the cereal library takes 100x longer to deserialize a data structure of mine (detailed below) than it takes to read the same data off the drive:

    benchmarking Read
    mean: 465.7050 us, lb 460.9873 us, ub 471.0938 us, ci 0.950
    std dev: 25.79706 us, lb 22.19820 us, ub 30.81870 us, ci 0.950
    found 4 outliers among 100 samples (4.0%)
      4 (4.0%) high mild
    variance introduced by outliers: 53.460%
    variance is severely inflated by outliers
    
    benchmarking Read + Decode
    collecting 100 samples, 1 iterations each, in estimated 6.356502 s
    mean: 68.85135 ms, lb 67.65992 ms, ub 70.05832 ms, ci 0.950
    std dev: 6.134430 ms, lb 5.607914 ms, ub 6.755639 ms, ci 0.950
    variance introduced by outliers: 74.863%
    variance is severely inflated by outliers
    

    This is also supported by profiling typical deserialization usage of this data structure in a program of mine where 98% of the time is spent deserializing the data and 1% is IO plus the core algorithm:

    COST CENTRE                    MODULE               %time %alloc
    
    getWord8                       Data.Serialize.Get    30.5   40.4
    unGet                          Data.Serialize.Get    29.5   17.9
    getWord64be                    Data.Serialize.Get    14.0   10.7
    getListOf                      Data.Serialize.Get    10.2   12.8
    roll                           Data.Serialize         8.2   11.5
    shiftl_w64                     Data.Serialize.Get     3.4    2.9
    decode                         Data.Serialize         2.9    3.1
    main                           Main                   1.3    0.6
    

    The data structure I'm deserializing is an IntMap [Triplet Atom] and the definitions of the component types are given below:

    type Triplet a = (a, a, a)
    
    data Point = Point {
        _x :: {-# UNPACK #-} !Double ,
        _y :: {-# UNPACK #-} !Double ,
        _z :: {-# UNPACK #-} !Double }
    
    data Atom = Atom {
        _serial :: {-# UNPACK #-} !Int    ,
        _r      :: {-# UNPACK #-} !Point  ,
        _n      :: {-# UNPACK #-} !Word64 }
    

    I'm using the default IntMap, (,,) and [] instances provided by cereal, and the following types and instances for my custom types:

    instance Serialize Point where
        put (Point x y z) = do
            put x
            put y
            put z
        get = Point <$> get <*> get <*> get
    
    instance Serialize Atom where
        put (Atom s r n) = do
            put s
            put r
            put n
        get = Atom <$> get <*> get <*> get
    

    So my questions are:

    1. Why is deserialization so slow in general?
    2. Is there any way to change my data structure (i.e. IntMap/[]) to make the deserialization go faster?
    3. Is there any way to change my data types (i.e. Atom/Point) to make deserialization go faster?
    4. Are there faster alternatives than cereal within Haskell, or should I store the data structure in C-land to do more rapid deserialization (i.e. use mmap)?

    These files I am deserializing are being used for sub-indices for a search engine since the full index cannot fit in memory for the target computer (which is a consumer-grade desktop), so I store each sub-index on disk and read+decode the sub-indices pointed to by the initial global index residing in memory. Also, I'm not concerned about serialization speed since searching the index is the bottle-neck for the end user and the current serialization performance of cereal is satisfactory for generating and updating the index.

    Edit:

    Tried out Don's suggestion of using a space-efficient triplet, and this quadrupled the speed:

    benchmarking Read
    mean: 468.9671 us, lb 464.2564 us, ub 473.8867 us, ci 0.950
    std dev: 24.67863 us, lb 21.71392 us, ub 28.39479 us, ci 0.950
    found 2 outliers among 100 samples (2.0%)
      2 (2.0%) high mild
    variance introduced by outliers: 50.474%
    variance is severely inflated by outliers
    
    benchmarking Read + Decode
    mean: 15.04670 ms, lb 14.99097 ms, ub 15.10520 ms, ci 0.950
    std dev: 292.7815 us, lb 278.8742 us, ub 308.1960 us, ci 0.950
    variance introduced by outliers: 12.303%
    variance is moderately inflated by outliers
    

    However, it still remains the bottleneck using up 25x as much time as IO. Also, can anybody explain why Don's suggestion works? Does this mean if I switched to something other than a list (like an array?) that it might give an improvement, too?

    Edit #2: Just switched to latest Haskell platform and reran profiling for cereal. The information is considerably more detailed and I've provided an hpaste of it.

    解决方案

    Ok. To answer this with the summary of the advice. For fast deserialization of data:

    • Use cereal (strict bytestring output) or binary (lazy bytetring output)
    • Make sure you're compiling with -O2, as these libraries rely on inlining to remove overhead
    • Use dense data types, such as replacing a polymorphic tuple with an unpacked, specialized form.
    • Avoid converting data types to lists to serialize them. If you have bytestrings, this is taken care of. For unpacked array types, you usually will get very fast IO, but worth double checking the instances
    • You may be able to use mmap'd IO
    • For double-heavy data consider a more efficient double reader.
    • Use modern array and container types tuned for performance, with more recent GHC versions.

    这篇关于如何在Haskell中进行快速数据反序列化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆