如何在Haskell中将大数据块解析到内存中? [英] How do I parse a large data block into memory in Haskell?

查看:107
本文介绍了如何在Haskell中将大数据块解析到内存中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

经过反思,这整个问题可以归结为更加简洁的事情。我正在寻找一个Haskell数据结构,其中
$ b

  • 看起来像一个列表

  • 具有O( 1)查找

  • 有O(1)元素替换元素或者O(1)元素附加元素(或者prepend ...如果是这种情况)。

  • 只有很少的内存开销






我试图构建一个图像文件解析器。文件格式是您的基本8位彩色ppm文件,但我打算支持16位彩色文件和PNG和JPEG文件。现有的Netpbm库尽管有很多拆箱注释,但在尝试加载我使用的文件时实际上会消耗所有可用内存:

<3-> 3张照片,最小大小为45MB,最大为110MB。

现在,我无法理解Netpbm代码中的优化,因此我决定尝试一下。这是一个简单的文件格式...



我已经开始决定不管文件格式是什么,我将以这种格式存储未压缩的最终图像:

  import Data.Vector.Unboxed(Vector)
数据PixelMap = RGB8 {
width :: Int
,height :: Int
,redChannel :: Vector Word8
,greenChannel :: Vector Word8
,blueChannel :: Vector Word8
}
code>

然后我写了一个解析器,它可以处理三个向量,如下所示:

  import Data.Attoparsec.ByteString 
data Progress = Progress {
addr :: Int
,size :: Int
,redC ::矢量Word8
,greenC ::向量Word8
,blueC ::向量Word8
}

parseColorBinary ::进度 - >解析器进度
parseColorBinary进度@进度{..}
| addr == size =返回进度
| addr< size = do
!redV< - anyWord8
!greenV< - anyWord8
!blueV< - anyWord8
parseColorBinary progress {addr = addr + 1
, redC = redC V.// [(addr,redV)]
,greenC = greenC V.// [(addr,greenV)]
,blueC = blueC V.// [(addr,blueV) )]}

在解析器的最后,我构建了RGB8:

 进度{..}<  -  parseColorBinary $ ... 
返回$ RGB8宽度高度redC greenC blueC

这样写的程序,加载这些45MB图像中的一个,将消耗5GB或更多的内存。如果我更改 Progress 的定义,使 redC greenC >和 blueC 都是!(Vector Word8),那么程序仍然在合理的内存范围内,长时间加载一个我不允许它真正完成的文件。最后,如果我用标准列表替换这里的矢量,我的内存使用率就会高达每个文件5GB(我认为......实际上在空间用完之前,空间用完了),加载时间大约6秒。 Ubuntu的预览应用程序一旦启动,几乎立即加载并渲染文件。



根据理论,每次调用V. //实际上都是每次都完全复制向量,我尝试切换到 Data.Vector.Unboxed.Mutable ,但是...我甚至无法将其转换为类型检查。文档不存在,并且理解数据类型正在做什么需要与多个其他库进行战斗。我甚至不知道它是否能解决问题,所以我很不情愿尝试。



基本问题其实非常简单:



我如何快速且不使用大量内存,读取,保留和可能操作非常大的数据结构?我发现的所有例子都是关于产生临时庞大的数据,然后尽快摆脱它。



原则上,我希望最终表示法是不可变的,但我不太在意是否必须使用可变结构才能到达那里。






为了完整性,完整的代码(BSD3许可)位于 https://bitbucket.org/savannidgerinel/photo-工具 performance 分支包含严格版本的解析器,可以通过对 Progress 数据的快速更改进行非限制结构 Codec.Image.Netpbm



运行性能测试

  ulimit -Sv 6000000  - 设置一个6GB的ulimit,或者改为任何对您有意义的内容
cabal build
dist / build / perf -test / perf-test + RTS -p -sstderr


解决方案

我首先想到的是,只是简单地阅读整个字节串,然后将内容解压缩到未装箱的向量就足够了。事实上,即使没有神秘的空间泄漏,您发布的解析代码也会非常糟糕:您会在输入的每个单字节中复制全部三个向量的全部内容!谈论二次复杂性。

所以我写了以下内容:

pre $ chunksOf3 :: [a] - > (a,b,c):chunksOf3 xs
chunksOf3 _ = []

parseRGB :: Int - > Atto.Parser(Vector Word8,Vector Word8,Vector Word8)
parseRGB size = do
input< - Atto.take(size * 3)
let(rs,gs,bs)= unzip3 $ chunksOf3 $ B.unpack input
return(V.fromList rs,V.fromList gs,V.fromList bs)

然后用一个45 MB的随机字节文件进行测试。我承认我很惊讶这个代码导致了RAM的使用量达到了千兆字节。我很好奇这个空间在哪里泄漏。

尽管可变向量工作得很好。以下代码使用133 Mb RAM和Criterion基准测试,以包含60 ms的文件读数。我在评论中包含了一些解释。在ST和其他地方也有充足的材料,可以在SO和其他地方使用(我同意图书馆文件对初学者不友好)。

 导入Data.Vector.Unboxed(Vector)
导入Data.ByteString(ByteString)

将合格的Data.Vector.Unboxed导入为V
导入合格的Data.ByteString作为B
导入限定的Data.Vector.Unboxed.Mutable作为MV

导入Control.Monad.ST.Strict
导入Data.Word
导入Control.Monad
import Control.DeepSeq

- 基准测试的东西
import Criterion.Main(defaultMainWith,bench,whnfIO)
import Criterion.Config(defaultConfig,Config(..) ,ljust)

- 这只是解析颜色的三个向量的部分。
- 当然,您可以通过
嵌入到Attoparsec计算中 - 当前输入,将其提供给parseRGB,或者您可以将正确的
大小的块解析器并省略下面代码中的Maybe测试。
parseRGB :: Int - > ByteString - >也许(矢量Word8,矢量Word8,矢量Word8)
parseRGB大小输入
| 3 *尺寸> B.length input = Nothing
|否则= Just $ runST $ do

- 我们正在分配三个大小为size的可变向量
- 这对于新用户来说通常会有些痛苦,因为我们必须
- 在某处指定正确的类型,而不是一个完全简单的类型。
- 在ST monad中,总是有一个s类型的参数来标记动作的
- 状态。除了内部类型通常还包含s作为
- 参数之外,ST s something类型有点类似于
- IO something。该s的目的是静态地禁止可变的
- 转义ST动作的变量。
[r,g,b]< - replicateM 3 $ MV.new size :: ST s [MV.MVector s Word8]

- forM_ =翻转图M_
- 在ST代码中forM_是一个更好看的近似
-imperative循环。
forM_ [0..size - 1] $ \i - >
让i'= 3 * i
MV.unsafeWrite ri(B.index input $ i')
MV.unsafeWrite gi(B.index input $ i'+ 1)
MV.unsafeWrite bi(B.index输入$ i'+ 2)

- 冻结将生活在ST monad中的可变向量转换为
- 一个常规向量,它可以然后从动作
返回 - 因为它的类型不再依赖于讨厌的s。
- unsafeFreeze在不进行复制的情况下进行转换。
- 这意味着原始可变向量不应该在
- unsafeFreezing之后使用。
[r,g,b]< - mapM V.unsafeFreeze [r,g,b]
return(r,g,b)

- 带有3 * 15百万个随机字节的文件。
inputSize = 15000000
benchConf = defaultConfig {cfgSamples = ljust 10}
$ b $ main = do
defaultMainWith benchConf(return())$ [
bench' parseRGB test$ whnfIO $ do
input< - B.readFilerandomInp.dat
force(parseRGB inputSize input)`seq` putStrLndone
]


On reflection, this entire question can be boiled down to something much more concise. I'm looking for a Haskell data structure that

  • looks like a list
  • has O(1) lookup
  • has either O(1) element replacement or O(1) element append (or prepend... I could reverse my index lookups if that were the case). I can always write my later algorithms with one or the other in mind.
  • has very little memory overhead

I'm trying to build an image file parser. The file format is your basic 8-bit color ppm file, though I intend to support 16-bit color files and PNG and JPEG files. The existing Netpbm library, despite a lot of unboxing annotations, actually consumes all available memory when trying to load the files that I work with:

3-10 photographs, the smallest being 45MB and the largest being 110MB.

Now, I can't understand the optimizations put into the Netpbm code, so I decided to have my own try at it. It's a simple file format...

I have started by deciding that no matter what the file format, I'm going to store the final image uncompressed in this format:

import Data.Vector.Unboxed (Vector)
data PixelMap = RGB8 {
      width :: Int
    , height :: Int
    , redChannel :: Vector Word8
    , greenChannel :: Vector Word8
    , blueChannel :: Vector Word8
    }

Then I wrote a parser that works on three vectors like so:

import Data.Attoparsec.ByteString
data Progress = Progress {
      addr      :: Int
    , size      :: Int
    , redC      :: Vector Word8
    , greenC    :: Vector Word8
    , blueC     :: Vector Word8
    }

parseColorBinary :: Progress -> Parser Progress
parseColorBinary progress@Progress{..}
    | addr == size = return progress
    | addr < size = do
        !redV <- anyWord8
        !greenV <- anyWord8
        !blueV <- anyWord8
        parseColorBinary progress { addr    = addr + 1
                                  , redC    = redC V.// [(addr, redV)]
                                  , greenC  = greenC V.// [(addr, greenV)]
                                  , blueC   = blueC V.// [(addr, blueV)] }

And at the end of the parser, I construct the RGB8 like so:

Progress{..} <- parseColorBinary $ ...
return $ RGB8 width height redC greenC blueC

Written like this, the program, loading a single one of these 45MB images, will consume 5GB or more of memory. If I change the definition of Progress so that redC, greenC, and blueC are all !(Vector Word8), then the program remains within reasonable memory confines, but takes so long to load a single file that I haven't allowed it to actually finish. Finally, if I replace the vectors here with standard lists, my memory usage shoots back up to 5GB per file (I assume... I actually run out of space before I hit that), and load time is on the order of 6 seconds. Ubuntu's preview application, once started, loads and renders the file nearly instantly.

On the theory that each call to V.// is actually fully copying the vector every single time, I tried switching to Data.Vector.Unboxed.Mutable, but... I can't even get that to typecheck. The documentation is nonexistent and understanding what the data types are doing is going to require fighting with multiple other libraries as well. And I don't even know if it will solve the problems, so I'm very reluctant to even try.

The fundamental problem is actually pretty straightforward:

How do I quickly, and without using an obscene amount of memory, read, retain, and potentially manipulate a very large data structure? All of the examples I have found are about generating temporarily huge data and then getting rid of it as soon as possible.

In principal, I want the final representation to be immutable, but I don't too much care if I have to use a mutable structure to get there.


Just for completeness, the complete code (BSD3-licensed) is on bitbucket in https://bitbucket.org/savannidgerinel/photo-tools . The performance branch contains a strict version of the parser, which can be made unstrict with a quick change in the Progress data structure of Codec.Image.Netpbm.

To run the performance test

ulimit -Sv 6000000 -- set a ulimit of 6GB, or change to whatever makes sense for you
cabal build
dist/build/perf-test/perf-test +RTS -p -sstderr

解决方案

I first thought that just simply reading the whole chunk of bytestring and then unzipping the contents into unboxed vectors would be good enough. Indeed, the parsing code you posted would be rather bad even without the mysterious space leak: you copy the entirety of all three vectors on every single byte of the input! Talk about quadratic complexity.

So I wrote the following:

chunksOf3 :: [a] -> [(a, a, a)]
chunksOf3 (a:b:c:xs) = (a, b, c) : chunksOf3 xs
chunksOf3 _          = []

parseRGB :: Int -> Atto.Parser (Vector Word8, Vector Word8, Vector Word8)
parseRGB size = do
    input <- Atto.take (size * 3)
    let (rs, gs, bs) = unzip3 $ chunksOf3 $ B.unpack input
    return (V.fromList rs, V.fromList gs, V.fromList bs)

And then tested it with a 45 Mb file of random bytes. I admit I was surprised that this code resulted in gigabytes of RAM usage. I'm curious as to where exactly the space leaks.

Mutable vectors work well though. The following code uses 133 Mb RAM and Criterion benchmarks it to 60 ms file reading included. I included some explanations in the comments. There is also ample material on the ST monad and mutable vectors on SO and elsewhere (I concur though that the library documentations are unfriendly to beginners).

import Data.Vector.Unboxed (Vector)
import Data.ByteString (ByteString)

import qualified Data.Vector.Unboxed as V
import qualified Data.ByteString as B
import qualified Data.Vector.Unboxed.Mutable as MV

import Control.Monad.ST.Strict 
import Data.Word
import Control.Monad
import Control.DeepSeq

-- benchmarking stuff
import Criterion.Main (defaultMainWith, bench, whnfIO)
import Criterion.Config (defaultConfig, Config(..), ljust)

-- This is just the part that parses the three vectors for the colors.
-- Of course, you can embed this into an Attoparsec computation by taking 
-- the current input, feeding it to parseRGB, or you can just take the right 
-- sized chunk in the parser and omit the "Maybe" test from the code below. 
parseRGB :: Int -> ByteString -> Maybe (Vector Word8, Vector Word8, Vector Word8)
parseRGB size input 
    | 3* size > B.length input = Nothing
    | otherwise = Just $ runST $ do

        -- We are allocating three mutable vectors of size "size"
        -- This is usually a bit of pain for new users, because we have to
        -- specify the correct type somewhere, and it's not an exactly simple type.
        -- In the ST monad there is always an "s" type parameter that labels the
        -- state of the action. A type of "ST s something" is a bit similar to
        -- "IO something", except that the inner type often also contains "s" as
        -- parameter. The purpose of that "s" is to statically disallow mutable
        -- variables from escaping the ST action. 
        [r, g, b] <- replicateM 3 $ MV.new size :: ST s [MV.MVector s Word8]

        -- forM_ = flip mapM_
        -- In ST code forM_ is a nicer looking approximation of the usual
        -- imperative loop. 
        forM_ [0..size - 1] $ \i -> do
            let i' = 3 * i
            MV.unsafeWrite r i (B.index input $ i'    )
            MV.unsafeWrite g i (B.index input $ i' + 1)
            MV.unsafeWrite b i (B.index input $ i' + 2)

        -- freeze converts a mutable vector living in the ST monad into 
        -- a regular vector, which can be then returned from the action
        -- since its type no longer depends on that pesky "s".
        -- unsafeFreeze does the conversion in place without copying.
        -- This implies that the original mutable vector should not be used after
        -- unsafeFreezing. 
        [r, g, b] <- mapM V.unsafeFreeze [r, g, b]
        return (r, g, b)

-- I prepared a file with 3 * 15 million random bytes.
inputSize = 15000000
benchConf = defaultConfig {cfgSamples = ljust 10}

main = do
    defaultMainWith benchConf (return ()) $ [
        bench "parseRGB test" $ whnfIO $ do 
            input <- B.readFile "randomInp.dat" 
            force (parseRGB inputSize input) `seq` putStrLn "done"
        ]

这篇关于如何在Haskell中将大数据块解析到内存中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆