Haskell:我可以在相同的懒惰列表上执行多次折叠而不在内存中保留列表吗? [英] Haskell: Can I perform several folds over the same lazy list without keeping list in memory?

查看:97
本文介绍了Haskell:我可以在相同的懒惰列表上执行多次折叠而不在内存中保留列表吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的背景是生物信息学,特别是下一代测序,但问题是通用的;所以我会使用一个日志文件作为例子。



这个文件非常大(千兆字节大,压缩,所以不适合内存),但很容易解析(每一行都是一个条目),所以我们可以很容易地写出如下内容:

  parse :: Lazy.ByteString  - > [LogEntry] 

现在,我想从日志文件中计算出很多统计数据。最简单的做法是编写单独的函数,例如:

  totalEntries = length 
nrBots = sum。来自Enum的地图。 map isBotEntry
averageTimeOfDay =直方图。 map extractHour

所有这些都是 foldl'k z形式。 map



问题是,如果我尝试以最自然的方式使用它们,比如

  main = do 
input< - Lazy.readFileinput.txt
let logEntries =解析输入
totalEntries' = totalEntries logEntries
nrBots'= nrBots logEntries
avgTOD = averageTimeOfDay logEntries
print totalEntries'
print nrBots'
print avgTOD

这将在内存中分配整个列表,这不是我想要的。我希望折叠是同步完成的,这样可以收集垃圾单元。如果我只计算一个统计量,这就是发生了什么。



我可以编写一个这样做的大函数,但它不是可组合的代码。 >

或者,我一直在做的是,我分别运行每个传递,但是这会重新加载&每次解压文件。

解决方案

对sdcvvc的评论引用这个'美丽的折叠'散文太酷了 - 很美,正如他所说 - 我忍不住添加了 Functor Applicative 实例以及其他一些现代化功能。例如 x y z 的同时折叠是一个简单的产品:(,,)< $> x * y * ž。我做了一个半字节的小随机整数文件,花费了10秒钟的时间 - 在我生锈的笔记本电脑上计算长度,总和和最大值。它似乎没有得到进一步的注释的帮助,但编译器可以看到 Int 是我所感兴趣的;读取明显的地图。行作为解析器导致了无望的空间和时间灾难,所以我展示了粗略使用 ByteString.readInt ;否则它基本上是一个 Data.List 过程。

  { - # LANGUAGE GADTs,BangPatterns# - } 

import Data.List(foldl',unfoldr)
import Control.Applicative
将限定的Data.ByteString.Lazy.Char8导入为B

main = fmap readInts(B.readFileint.txt)>> = print。折叠allThree
where allThree =(,,)< $> length_ *> sum_ *最大_

数据折叠b c其中F ::(a - > b - > a) - > a - > (a - > c) - >折叠bc
data Pair ab = P!a!b

实例Functor(Fold b)其中fmap f(F op xg)= F op x(f。g)

实例应用(折叠b)其中
pure c = F const()(const c)
(F fxc)* (f a c')= F(梳fg)(p xy)(c *** c')
其中梳fg(P a a')b = P(fab)(g a'b)
(***)fg(P xy)= fx(gy)

fold ::折叠bc - > [b] - > c
fold(F f x c)bs = c $(foldl'f x bs)

sum_,product_ :: Num a =>折叠aa
length_ ::折叠一个Int
sum_ = F(+)0 id
product_ = F(*)1 id
length_ = F(常数(+1) )0 id
maximum_ = F max 0 id
readInts = unfoldr $ \bs - >案例B.readInt bs of
Nothing - > Nothing
Just(n,bs2) - >如果不是(B.null bs2),那么Just(n,B.tail bs2)
else Just(n,B.empty)

编辑:毫不奇怪,因为我们必须处理上面的unboxed类型,以及从例如派生的unboxed向量一个2G文件可以适应内存,这是所有速度的两倍,并且在Data.Vector.Uboxed的 http:/ /hpaste.org/69270 当然,这与 LogEntry 类型无关,但请注意 Fold type和fold'multiplication'概括了没有修订的顺序类型,例如与 Char s或 Word8 s上的操作相关的折叠可以同时直接叠加在ByteString上。首先必须定义一个 foldB ,通过重新编辑 fold 来使用 foldl' s。但是 Fold Fold 的产品与您折叠 Char s或 Word8 s


My context is bioinformatics, next-generation sequencing in particular, but the problem is generic; so I will use a log file as an example.

The file is very large (Gigabytes large, compressed, so it will not fit in memory), but is easy to parse (each line is an entry), so we can easily write something like:

parse :: Lazy.ByteString -> [LogEntry]

Now, I have a lot of statistics that I would like to compute from the log file. It is easiest to write separate functions such as:

totalEntries = length
nrBots = sum . map fromEnum . map isBotEntry
averageTimeOfDay = histogram . map extractHour

All of these are of the form foldl' k z . map f.

The problem is that if I try to use them in the most natural way, like

main = do
    input <- Lazy.readFile "input.txt"
    let logEntries = parse input
        totalEntries' = totalEntries logEntries
        nrBots' = nrBots logEntries
        avgTOD = averageTimeOfDay logEntries
    print totalEntries'
    print nrBots'
    print avgTOD

This will allocate the whole list in memory, which is not what I want. I want the folds to be done synchronously, so that the cons cells can be garbage collected. If I compute only a single statistic, this is what happens.

I can write a single big function that does this, but it is non-composable code.

Alternatively, which is what I have been doing, I run each pass separately, but this reloads & uncompresses the file each time.

解决方案

This a comment on the comment of sdcvvc referring to this 'beautiful folding' essay It was so cool -- beautiful, as he says -- I couldn't resist adding Functor and Applicative instances and a few other bits of modernization. Simultaneous folding of, say, x y and z is a straightforward product: (,,) <$> x <*> y <*> z. I made a half-gigabyte file of small random ints and it took 10 seconds to give the -- admittedly trivial -- calculation of length, sum and maximum on my rusty laptop. It doesn't seem to be helped by further annotations, but the compiler could see Int was all I was interested in; the obvious map read . lines as a parser led to a hopeless space and time catastrophe so I unfolded with a crude use of ByteString.readInt; otherwise it is basically a Data.List process.

{-# LANGUAGE GADTs, BangPatterns #-}

import Data.List (foldl', unfoldr)
import Control.Applicative 
import qualified Data.ByteString.Lazy.Char8 as B

main = fmap readInts (B.readFile "int.txt") >>= print . fold allThree
  where allThree = (,,) <$> length_ <*> sum_ <*> maximum_

data Fold b c where  F ::  (a -> b -> a) -> a -> (a -> c) -> Fold b c
data Pair a b = P !a !b

instance Functor (Fold b) where  fmap f (F op x g) = F op x (f . g)

instance Applicative (Fold b) where
  pure c = F const () (const c)
  (F f x c) <*> (F g y c') = F (comb f g) (P x y) (c *** c')
    where comb f g (P a a') b = P (f a b) (g a' b)
          (***) f g (P x y) = f x ( g y)

fold :: Fold b c -> [b] -> c
fold (F f x c) bs = c $ (foldl' f x bs)

sum_, product_ :: Num a => Fold a a
length_ :: Fold a Int
sum_     = F (+) 0 id
product_ = F (*) 1 id
length_  = F (const . (+1)) 0 id
maximum_ = F max 0 id
readInts  = unfoldr $ \bs -> case B.readInt bs of
  Nothing      -> Nothing
  Just (n,bs2) -> if not (B.null bs2) then Just (n,B.tail bs2) 
                                      else Just (n,B.empty)

Edit: unsurprisingly, since we have to do with an unboxed type above, and an unboxed vector derived from e.g. a 2G file can fit in memory, this is all twice as fast and somewhat better behaved if it is given the obvious relettering for Data.Vector.Uboxed http://hpaste.org/69270 Of course this isn't relevant where one has types like LogEntry Note though that the Fold type and Fold 'multiplication' generalizes over sequential types without revision, thus e.g. the Folds associated with operations on Chars or Word8s can be simultaneously folded directly over a ByteString. One must first define a foldB, by relettering fold to use the foldl's in the various ByteString modules. But the Folds and products of Folds are the same ones you would fold a list or vector of Chars or Word8s

这篇关于Haskell:我可以在相同的懒惰列表上执行多次折叠而不在内存中保留列表吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆