Haskell中的高效比特流 [英] Efficient bitstreams in Haskell
问题描述
正在进行有效地摆弄位的尝试(例如,请参见 SO问题)的最新挑战是有效的流传输和比特消耗.
In an ongoing endeavour to efficiently fiddle with bits (e.g. see this SO question) the newest challenge is the efficient streaming and consumption of bits.
作为第一个简单的任务,我选择在/dev/urandom
生成的比特流中找到最长的相同比特序列.典型的咒语是head -c 1000000 </dev/urandom | my-exe
.实际目标是流式传输并解码 Elias伽玛代码,例如,不是字节块或其倍数.
As a first simple task I choose to find the longest sequence of identical bits in a bitstream generated by /dev/urandom
. A typical incantation would be head -c 1000000 </dev/urandom | my-exe
. The actual goal is to stream bits and decode an Elias gamma code, for example, i.e. codes that are not chunks of bytes or multiples thereof.
对于此类长度可变的代码,最好使用take
,takeWhile
,group
等语言进行列表操作.由于BitStream.take
实际上会消耗双流的一部分,因此可能会有一些monad起作用.
For such codes of variable length it is nice to have the take
, takeWhile
, group
, etc. language for list manipulation. Since a BitStream.take
would actually consume part of the bistream some monad would probably come into play.
The obvious starting point is the lazy bytestring from Data.ByteString.Lazy
.
A.计数字节
这是一个非常简单的Haskell程序,与预期的C语言程序一样,可以执行.
This very simple Haskell program performs on par with a C program, as is to be expected.
import qualified Data.ByteString.Lazy as BSL
main :: IO ()
main = do
bs <- BSL.getContents
print $ BSL.length bs
B.添加字节
一旦我开始使用unpack
,情况就会变得更糟.
Once I start using unpack
things should get worse.
main = do
bs <- BSL.getContents
print $ sum $ BSL.unpack bs
令人惊讶的是,Haskell和C表现出几乎相同的表现.
Suprisingly, Haskell and C show the almost same performance.
C.相同位的最长序列
作为第一个重要任务,可以找到如下所示的相同比特的最长序列:
As a first nontrivial task the longest sequence of identical bits can be found like this:
module Main where
import Data.Bits (shiftR, (.&.))
import qualified Data.ByteString.Lazy as BSL
import Data.List (group)
import Data.Word8 (Word8)
splitByte :: Word8 -> [Bool]
splitByte w = Prelude.map (\i-> (w `shiftR` i) .&. 1 == 1) [0..7]
bitStream :: BSL.ByteString -> [Bool]
bitStream bs = concat $ map splitByte (BSL.unpack bs)
main :: IO ()
main = do
bs <- BSL.getContents
print $ maximum $ length <$> (group $ bitStream bs)
惰性字节串被转换为列表[Word8]
,然后通过移位将每个Word
拆分为多个位,从而生成列表[Bool]
.然后用concat
展平此列表列表.获得Bool
的(懒惰)列表后,请使用group
将列表拆分为相同位的序列,然后在其上映射length
.最后,maximum
给出所需的结果.很简单,但不是很快:
The lazy bytestring is converted to a list [Word8]
and then, using shifts, each Word
is split into the bits, resulting in a list [Bool]
. This list of lists is then flattened with concat
. Having obtained a (lazy) list of Bool
, use group
to split the list into sequences of identical bits and then map length
over it. Finally maximum
gives the desired result. Quite simple, but not very fast:
# C
real 0m0.606s
# Haskell
real 0m6.062s
这种天真的实现方式要慢一个数量级.
This naive implementation is exactly one order of magnitude slower.
分析表明分配了很多内存(解析1MB输入大约需要3GB).但是,没有观察到大量的空间泄漏.
Profiling shows that quite a lot of memory gets allocated (about 3GB for parsing 1MB of input). There is no massive space leak to be observed, though.
我从这里开始四处张望:
From here I start poking around:
- 有一个
bitstream
包承诺"具有半自动流融合的快速,打包,严格的位流(即Bools列表).".不幸的是,它不是最新的vector
软件包,请参见此处以获得详细信息. - 接下来,我调查
streaming
.我不太清楚为什么我需要有效"的流传输来使一些monad发挥作用-至少直到我开始执行所提出的任务的相反操作(即,将比特流编码并写入文件)为止. - 在
ByteString
上只是fold
怎么样?我必须引入状态来跟踪消耗的位.并不是很理想的take
,takeWhile
,group
等语言.
- There is a
bitstream
package that promises "Fast, packed, strict bit streams (i.e. list of Bools) with semi-automatic stream fusion.". Unfortunately it is not up-to-date with the currentvector
package, see here for details. - Next, I investigate
streaming
. I don't quite see why I should need 'effectful' streaming that brings some monad into play - at least until I start with the reverse of the posed task, i.e. encoding and writing bitstreams to file. - How about just
fold
-ing over theByteString
? I'd have to introduce state to keep track of consumed bits. That's not quite the nicetake
,takeWhile
,group
, etc. language that is desirable.
现在我不确定要去哪里.
And now I'm not quite sure where to go.
更新:
我想出了如何使用 streaming
和 streaming-bytestring
.我可能没有正确执行此操作,因为结果是灾难性的.
I figured out how to do this with streaming
and streaming-bytestring
. I'm probably not doing this right because the result is catastrophically bad.
import Data.Bits (shiftR, (.&.))
import qualified Data.ByteString.Streaming as BSS
import Data.Word8 (Word8)
import qualified Streaming as S
import Streaming.Prelude (Of, Stream)
import qualified Streaming.Prelude as S
splitByte :: Word8 -> [Bool]
splitByte w = (\i-> (w `shiftR` i) .&. 1 == 1) <$> [0..7]
bitStream :: Monad m => Stream (Of Word8) m () -> Stream (Of Bool) m ()
bitStream s = S.concat $ S.map splitByte s
main :: IO ()
main = do
let bs = BSS.unpack BSS.getContents :: Stream (Of Word8) IO ()
gs = S.group $ bitStream bs :: Stream (Stream (Of Bool) IO) IO ()
maxLen <- S.maximum $ S.mapped S.length gs
print $ S.fst' maxLen
这将测试您从stdin输入的几千字节以外的内容的耐心.探查器说,它在Streaming.Internal.>>=.loop
和Data.Functor.Of.fmap
中花费了大量时间(输入大小为平方).我不太确定第一个是什么,但是fmap
表示(?)这些Of a b
的玩法对我们没有任何好处,因为我们身处IO monad中,所以不可能优化.
This will test your patience with anything beyond a few thousand bytes of input from stdin. The profiler says it spends an insane amount of time (quadratic in the input size) in Streaming.Internal.>>=.loop
and Data.Functor.Of.fmap
. I'm not quite sure what the first one is, but the fmap
indicates (?) that the juggling of these Of a b
isn't doing us any good and because we're in the IO monad it can't be optimised away.
我也有相当于字节加法器的流式传输这里:SumBytesStream.hs
,它比简单的懒惰ByteString
实现要慢一些,但仍然不错.由于streaming-bytestring
被宣称是"字节串正确完成我希望更好.那我可能做错了.
I also have the streaming equivalent of the byte adder here: SumBytesStream.hs
, which is slightly slower than the simple lazy ByteString
implementation, but still decent. Since streaming-bytestring
is proclaimed to be "bytestring io done right" I expected better. I'm probably not doing it right, then.
无论如何,所有这些位计算都不应在IO monad中发生.但是BSS.getContents
迫使我进入IO monad,因为getContents :: MonadIO m => ByteString m ()
并没有出路.
In any case, all these bit-computations shouldn't be happening in the IO monad. But BSS.getContents
forces me into the IO monad because getContents :: MonadIO m => ByteString m ()
and there's no way out.
更新2
按照@dfeuer的建议,我在master @ HEAD上使用了 streaming
包.这是结果.
Following the advice of @dfeuer I used the streaming
package at master@HEAD. Here's the result.
longest-seq-c 0m0.747s (C)
longest-seq 0m8.190s (Haskell ByteString)
longest-seq-stream 0m13.946s (Haskell streaming-bytestring)
Streaming.concat
的O(n ^ 2)问题已解决,但我们仍未接近C基准.
The O(n^2) problem of Streaming.concat
is solved, but we're still not getting closer to the C benchmark.
更新3
Cirdec的解决方案产生的性能与C相当.所使用的构造称为教会编码列表",请参见此排名为N的类型.
Cirdec's solution produces a performance on par with C. The construct that is used is called "Church encoded lists", see this SO answer or the Haskell Wiki on rank-N types.
源文件:
所有源文件都可以在 github 上找到. Makefile
具有运行实验和配置文件的所有各种目标.默认的make
将只构建所有内容(首先创建一个bin/
目录!),然后make time
将对longest-seq
可执行文件进行计时. C可执行文件附加了-c
来区分它们.
All the source files can be found on github. The Makefile
has all the various targets to run the experiments and the profiling. The default make
will just build everything (create a bin/
directory first!) and then make time
will do the timing on the longest-seq
executables. The C executables get a -c
appended to distinguish them.
推荐答案
当对流的操作融合在一起时,可以除去中间分配及其相应的开销. GHC前奏以重写规则.一般的想法是,如果一个函数产生的结果看起来像一个文件夹(它对(:)
和[]
应用了类型(a -> b -> b) -> b -> b
),而另一个函数消耗了一个看起来像文件夹的列表,则构造中间列表可以被删除.
Intermediate allocations and their corresponding overhead can be removed when operations on streams fuse together. The GHC prelude provides foldr/build fusion for lazy streams in the form of rewrite rules. The general idea is if one function produces a result that looks like a foldr (it has the type (a -> b -> b) -> b -> b
applied to (:)
and []
) and another function consumes a list that looks like a foldr, constructing the intermediate list can be removed.
对于您的问题,我将构建类似的内容,但使用严格的左折(foldl'
)而不是文件夹.我将使用一种强制将列表看起来像左折的数据类型,而不是使用试图检测何时看起来像foldl
的重写规则.
For your problem I'm going to build something similar, but using strict left folds (foldl'
) instead of foldr. Instead of using rewrite rules that try to detect when something looks like a foldl
, I'll use a data type that forces lists to look like left folds.
-- A list encoded as a strict left fold.
newtype ListS a = ListS {build :: forall b. (b -> a -> b) -> b -> b}
由于我从放弃列表开始,所以我们将重新实现列表前奏的一部分.
Since I've started by abandoning lists we'll be re-implementing part of the prelude for lists.
可以从列表和字节串的foldl'
函数创建严格的左折.
Strict left folds can be created from the foldl'
functions of both lists and bytestrings.
{-# INLINE fromList #-}
fromList :: [a] -> ListS a
fromList l = ListS (\c z -> foldl' c z l)
{-# INLINE fromBS #-}
fromBS :: BSL.ByteString -> ListS Word8
fromBS l = ListS (\c z -> BSL.foldl' c z l)
使用一个的最简单的例子是查找列表的长度.
The simplest example of using one is to find the length of a list.
{-# INLINE length' #-}
length' :: ListS a -> Int
length' l = build l (\z a -> z+1) 0
我们还可以映射并连接左折.
We can also map and concatenate left folds.
{-# INLINE map' #-}
-- fmap renamed so it can be inlined
map' f l = ListS (\c z -> build l (\z a -> c z (f a)) z)
{-# INLINE concat' #-}
concat' :: ListS (ListS a) -> ListS a
concat' ll = ListS (\c z -> build ll (\z l -> build l c z) z)
对于您的问题,我们需要能够将一个单词拆分为多个位.
For your problem we need to be able to split a word into bits.
{-# INLINE splitByte #-}
splitByte :: Word8 -> [Bool]
splitByte w = Prelude.map (\i-> (w `shiftR` i) .&. 1 == 1) [0..7]
{-# INLINE splitByte' #-}
splitByte' :: Word8 -> ListS Bool
splitByte' = fromList . splitByte
然后将ByteString
转换为位
{-# INLINE bitStream' #-}
bitStream' :: BSL.ByteString -> ListS Bool
bitStream' = concat' . map' splitByte' . fromBS
要找到最长的运行,我们将跟踪先前的值,当前运行的长度以及最长的运行的长度.我们将字段设置为严格,以便折叠的严格性可以防止乱码链条在内存中积累.为状态创建严格的数据类型是一种容易控制其内存表示形式和评估其字段的方式.
To find the longest run we'll keep track of the previous value, the length of the current run, and the length of the longest run. We make the fields strict so that the strictness of the fold prevents chains of thunks from being accumulated in memory. Making a strict data type for a state is an easy way to get control over both its memory representation and when its fields are evaluated.
data LongestRun = LongestRun !Bool !Int !Int
{-# INLINE extendRun #-}
extendRun (LongestRun previous run longest) x = LongestRun x current (max current longest)
where
current = if x == previous then run + 1 else 1
{-# INLINE longestRun #-}
longestRun :: ListS Bool -> Int
longestRun l = longest
where
(LongestRun _ _ longest) = build l extendRun (LongestRun False 0 0)
我们完成了
main :: IO ()
main = do
bs <- BSL.getContents
print $ longestRun $ bitStream' bs
这要快得多,但不及c的性能.
This is much faster, but not quite the performance of c.
longest-seq-c 0m00.12s (C)
longest-seq 0m08.65s (Haskell ByteString)
longest-seq-fuse 0m00.81s (Haskell ByteString fused)
程序将分配大约1 Mb的空间来从输入中读取1000000字节.
The program allocates about 1 Mb to read 1000000 bytes from input.
total alloc = 1,173,104 bytes (excludes profiling overheads)
更新了 github代码
这篇关于Haskell中的高效比特流的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!