Haskell中的高效比特流 [英] Efficient bitstreams in Haskell

查看:156
本文介绍了Haskell中的高效比特流的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

正在进行有效地摆弄位的尝试(例如,请参见 SO问题)的最新挑战是有效的流传输和比特消耗.

In an ongoing endeavour to efficiently fiddle with bits (e.g. see this SO question) the newest challenge is the efficient streaming and consumption of bits.

作为第一个简单的任务,我选择在/dev/urandom生成的比特流中找到最长的相同比特序列.典型的咒语是head -c 1000000 </dev/urandom | my-exe.实际目标是流式传输并解码 Elias伽玛代码,例如,不是字节块或其倍数.

As a first simple task I choose to find the longest sequence of identical bits in a bitstream generated by /dev/urandom. A typical incantation would be head -c 1000000 </dev/urandom | my-exe. The actual goal is to stream bits and decode an Elias gamma code, for example, i.e. codes that are not chunks of bytes or multiples thereof.

对于此类长度可变的代码,最好使用taketakeWhilegroup等语言进行列表操作.由于BitStream.take实际上会消耗双流的一部分,因此可能会有一些monad起作用.

For such codes of variable length it is nice to have the take, takeWhile, group, etc. language for list manipulation. Since a BitStream.take would actually consume part of the bistream some monad would probably come into play.

明显的起点是

The obvious starting point is the lazy bytestring from Data.ByteString.Lazy.

A.计数字节

这是一个非常简单的Haskell程序,与预期的C语言程序一样,可以执行.

This very simple Haskell program performs on par with a C program, as is to be expected.

import qualified Data.ByteString.Lazy as BSL

main :: IO ()
main = do
    bs <- BSL.getContents
    print $ BSL.length bs

B.添加字节

一旦我开始使用unpack,情况就会变得更糟.

Once I start using unpack things should get worse.

main = do
    bs <- BSL.getContents
    print $ sum $ BSL.unpack bs

令人惊讶的是,Haskell和C表现出几乎相同的表现.

Suprisingly, Haskell and C show the almost same performance.

C.相同位的最长序列

作为第一个重要任务,可以找到如下所示的相同比特的最长序列:

As a first nontrivial task the longest sequence of identical bits can be found like this:

module Main where

import           Data.Bits            (shiftR, (.&.))
import qualified Data.ByteString.Lazy as BSL
import           Data.List            (group)
import           Data.Word8           (Word8)

splitByte :: Word8 -> [Bool]
splitByte w = Prelude.map (\i-> (w `shiftR` i) .&. 1 == 1) [0..7]

bitStream :: BSL.ByteString -> [Bool]
bitStream bs = concat $ map splitByte (BSL.unpack bs)

main :: IO ()
main = do
    bs <- BSL.getContents
    print $ maximum $ length <$> (group $ bitStream bs)

惰性字节串被转换为列表[Word8],然后通过移位将每个Word拆分为多个位,从而生成列表[Bool].然后用concat展平此列表列表.获得Bool的(懒惰)列表后,请使用group将列表拆分为相同位的序列,然后在其上映射length.最后,maximum给出所需的结果.很简单,但不是很快:

The lazy bytestring is converted to a list [Word8] and then, using shifts, each Word is split into the bits, resulting in a list [Bool]. This list of lists is then flattened with concat. Having obtained a (lazy) list of Bool, use group to split the list into sequences of identical bits and then map length over it. Finally maximum gives the desired result. Quite simple, but not very fast:

# C
real    0m0.606s

# Haskell
real    0m6.062s

这种天真的实现方式要慢一个数量级.

This naive implementation is exactly one order of magnitude slower.

分析表明分配了很多内存(解析1MB输入大约需要3GB).但是,没有观察到大量的空间泄漏.

Profiling shows that quite a lot of memory gets allocated (about 3GB for parsing 1MB of input). There is no massive space leak to be observed, though.

我从这里开始四处张望:

From here I start poking around:

  • 有一个 bitstream承诺"具有半自动流融合的快速,打包,严格的位流(即Bools列表).".不幸的是,它不是最新的vector软件包,请参见此处以获得详细信息.
  • 接下来,我调查 streaming .我不太清楚为什么我需要有效"的流传输来使一些monad发挥作用-至少直到我开始执行所提出的任务的相反操作(即,将比特流编码并写入文件)为止.
  • ByteString上只是fold怎么样?我必须引入状态来跟踪消耗的位.并不是很理想的taketakeWhilegroup等语言.
  • There is a bitstream package that promises "Fast, packed, strict bit streams (i.e. list of Bools) with semi-automatic stream fusion.". Unfortunately it is not up-to-date with the current vector package, see here for details.
  • Next, I investigate streaming. I don't quite see why I should need 'effectful' streaming that brings some monad into play - at least until I start with the reverse of the posed task, i.e. encoding and writing bitstreams to file.
  • How about just fold-ing over the ByteString? I'd have to introduce state to keep track of consumed bits. That's not quite the nice take, takeWhile, group, etc. language that is desirable.

现在我不确定要去哪里.

And now I'm not quite sure where to go.

更新:

我想出了如何使用 streaming streaming-bytestring .我可能没有正确执行此操作,因为结果是灾难性的.

I figured out how to do this with streaming and streaming-bytestring. I'm probably not doing this right because the result is catastrophically bad.

import           Data.Bits                 (shiftR, (.&.))
import qualified Data.ByteString.Streaming as BSS
import           Data.Word8                (Word8)
import qualified Streaming                 as S
import           Streaming.Prelude         (Of, Stream)
import qualified Streaming.Prelude         as S

splitByte :: Word8 -> [Bool]
splitByte w = (\i-> (w `shiftR` i) .&. 1 == 1) <$> [0..7]

bitStream :: Monad m => Stream (Of Word8) m () -> Stream (Of Bool) m ()
bitStream s = S.concat $ S.map splitByte s

main :: IO ()
main = do
    let bs = BSS.unpack BSS.getContents :: Stream (Of Word8) IO ()
        gs = S.group $ bitStream bs ::  Stream (Stream (Of Bool) IO) IO ()
    maxLen <- S.maximum $ S.mapped S.length gs
    print $ S.fst' maxLen

这将测试您从stdin输入的几千字节以外的内容的耐心.探查器说,它在Streaming.Internal.>>=.loopData.Functor.Of.fmap中花费了大量时间(输入大小为平方).我不太确定第一个是什么,但是fmap表示(?)这些Of a b的玩法对我们没有任何好处,因为我们身处IO monad中,所以不可能优化.

This will test your patience with anything beyond a few thousand bytes of input from stdin. The profiler says it spends an insane amount of time (quadratic in the input size) in Streaming.Internal.>>=.loop and Data.Functor.Of.fmap. I'm not quite sure what the first one is, but the fmap indicates (?) that the juggling of these Of a b isn't doing us any good and because we're in the IO monad it can't be optimised away.

我也有相当于字节加法器的流式传输这里:SumBytesStream.hs ,它比简单的懒惰ByteString实现要慢一些,但仍然不错.由于streaming-bytestring宣称是"字节串正确完成我希望更好.那我可能做错了.

I also have the streaming equivalent of the byte adder here: SumBytesStream.hs, which is slightly slower than the simple lazy ByteString implementation, but still decent. Since streaming-bytestring is proclaimed to be "bytestring io done right" I expected better. I'm probably not doing it right, then.

无论如何,所有这些位计算都不应在IO monad中发生.但是BSS.getContents迫使我进入IO monad,因为getContents :: MonadIO m => ByteString m ()并没有出路.

In any case, all these bit-computations shouldn't be happening in the IO monad. But BSS.getContents forces me into the IO monad because getContents :: MonadIO m => ByteString m () and there's no way out.

更新2

按照@dfeuer的建议,我在master @ HEAD上使用了 streaming 包.这是结果.

Following the advice of @dfeuer I used the streaming package at master@HEAD. Here's the result.

longest-seq-c       0m0.747s    (C)
longest-seq         0m8.190s    (Haskell ByteString)
longest-seq-stream  0m13.946s   (Haskell streaming-bytestring)

Streaming.concat的O(n ^ 2)问题已解决,但我们仍未接近C基准.

The O(n^2) problem of Streaming.concat is solved, but we're still not getting closer to the C benchmark.

更新3

Cirdec的解决方案产生的性能与C相当.所使用的构造称为教会编码列表",请参见此排名为N的类型.

Cirdec's solution produces a performance on par with C. The construct that is used is called "Church encoded lists", see this SO answer or the Haskell Wiki on rank-N types.

源文件:

所有源文件都可以在 github 上找到. Makefile具有运行实验和配置文件的所有各种目标.默认的make将只构建所有内容(首先创建一个bin/目录!),然后make time将对longest-seq可执行文件进行计时. C可执行文件附加了-c来区分它们.

All the source files can be found on github. The Makefile has all the various targets to run the experiments and the profiling. The default make will just build everything (create a bin/ directory first!) and then make time will do the timing on the longest-seq executables. The C executables get a -c appended to distinguish them.

推荐答案

当对流的操作融合在一起时,可以除去中间分配及其相应的开销. GHC前奏以重写规则.一般的想法是,如果一个函数产生的结果看起来像一个文件夹(它对(:)[]应用了类型(a -> b -> b) -> b -> b),而另一个函数消耗了一个看起来像文件夹的列表,则构造中间列表可以被删除.

Intermediate allocations and their corresponding overhead can be removed when operations on streams fuse together. The GHC prelude provides foldr/build fusion for lazy streams in the form of rewrite rules. The general idea is if one function produces a result that looks like a foldr (it has the type (a -> b -> b) -> b -> b applied to (:) and []) and another function consumes a list that looks like a foldr, constructing the intermediate list can be removed.

对于您的问题,我将构建类似的内容,但使用严格的左折(foldl')而不是文件夹.我将使用一种强制将列表看起来像左折的数据类型,而不是使用试图检测何时看起来像foldl的重写规则.

For your problem I'm going to build something similar, but using strict left folds (foldl') instead of foldr. Instead of using rewrite rules that try to detect when something looks like a foldl, I'll use a data type that forces lists to look like left folds.

-- A list encoded as a strict left fold.
newtype ListS a = ListS {build :: forall b. (b -> a -> b) -> b -> b}

由于我从放弃列表开始,所以我们将重新实现列表前奏的一部分.

Since I've started by abandoning lists we'll be re-implementing part of the prelude for lists.

可以从列表和字节串的foldl'函数创建严格的左折.

Strict left folds can be created from the foldl' functions of both lists and bytestrings.

{-# INLINE fromList #-}
fromList :: [a] -> ListS a
fromList l = ListS (\c z -> foldl' c z l)

{-# INLINE fromBS #-}
fromBS :: BSL.ByteString -> ListS Word8
fromBS l = ListS (\c z -> BSL.foldl' c z l)

使用一个的最简单的例子是查找列表的长度.

The simplest example of using one is to find the length of a list.

{-# INLINE length' #-}
length' :: ListS a -> Int
length' l = build l (\z a -> z+1) 0

我们还可以映射并连接左折.

We can also map and concatenate left folds.

{-# INLINE map' #-}
-- fmap renamed so it can be inlined
map' f l = ListS (\c z -> build l (\z a -> c z (f a)) z)

{-# INLINE concat' #-}
concat' :: ListS (ListS a) -> ListS a
concat' ll = ListS (\c z -> build ll (\z l -> build l c z) z)

对于您的问题,我们需要能够将一个单词拆分为多个位.

For your problem we need to be able to split a word into bits.

{-# INLINE splitByte #-}
splitByte :: Word8 -> [Bool]
splitByte w = Prelude.map (\i-> (w `shiftR` i) .&. 1 == 1) [0..7]

{-# INLINE splitByte' #-}
splitByte' :: Word8 -> ListS Bool
splitByte' = fromList . splitByte

然后将ByteString转换为位

{-# INLINE bitStream' #-}
bitStream' :: BSL.ByteString -> ListS Bool
bitStream' = concat' . map' splitByte' . fromBS

要找到最长的运行,我们将跟踪先前的值,当前运行的长度以及最长的运行的长度.我们将字段设置为严格,以便折叠的严格性可以防止乱码链条在内存中积累.为状态创建严格的数据类型是一种容易控制其内存表示形式和评估其字段的方式.

To find the longest run we'll keep track of the previous value, the length of the current run, and the length of the longest run. We make the fields strict so that the strictness of the fold prevents chains of thunks from being accumulated in memory. Making a strict data type for a state is an easy way to get control over both its memory representation and when its fields are evaluated.

data LongestRun = LongestRun !Bool !Int !Int

{-# INLINE extendRun #-}
extendRun (LongestRun previous run longest) x = LongestRun x current (max current longest)
  where
    current = if x == previous then run + 1 else 1

{-# INLINE longestRun #-}
longestRun :: ListS Bool -> Int
longestRun l = longest
 where
   (LongestRun _ _ longest) = build l extendRun (LongestRun False 0 0)

我们完成了

main :: IO ()
main = do
    bs <- BSL.getContents
    print $ longestRun $ bitStream' bs

这要快得多,但不及c的性能.

This is much faster, but not quite the performance of c.

longest-seq-c       0m00.12s    (C)
longest-seq         0m08.65s    (Haskell ByteString)
longest-seq-fuse    0m00.81s    (Haskell ByteString fused)

程序将分配大约1 Mb的空间来从输入中读取1000000字节.

The program allocates about 1 Mb to read 1000000 bytes from input.

total alloc =   1,173,104 bytes  (excludes profiling overheads)

更新了 github代码

这篇关于Haskell中的高效比特流的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆