解析Haskell中的大型日志文件 [英] Parsing large log files in Haskell

查看:90
本文介绍了解析Haskell中的大型日志文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有几个200mb +的文件,我想grep通过。如何在Haskell中做到这一点?



这是我的初始程序:

  import Data.List 
import Control.Monad
import System.IO
import System.Environment

main = do
filename< - liftM head getArgs
contents< - liftM lines $ readFile filename
putStrLn。不合格。过滤器(isPrefixOf导入)$内容

在解析整个文件之前,它会将整个文件读入内存。
然后我用这个:

  import Data.List 
import Control.Monad
import System.IO
import System.Environment

main = do
filename< - liftM head getArgs
file< - (openFile filename ReadMode)
内容< - liftM行$ hGetContents文件
putStrLn。不合格。过滤器(isPrefixOf导入)$内容

我认为自 hGetContents 是懒惰的,它会避免将整个文件读入内存中一>。但是,在 valgrind 下运行这两个脚本都显示出类似的内存使用情况。所以,无论我的脚本是错误的,还是 valgrind 都是错误的。我使用

编译脚本

  ghc --make test.hs -prof 

我错过了什么?奖金问题:我看到很多关于如何在Haskell中使用惰性IO实际上是一件坏事的提及。如何/为什么我会使用严格的IO?



更新: 看起来我在阅读valgrind时错了。使用 + RTS -s ,以下是我得到的结果:

  7,807,461,968字节在堆中分配
在GC
期间复制的1,563,351,416字节最大居民数(1150个样本)
45,576字节最大值slb
2 MB总内存使用量(0 MB丢失由于分割)

第0代:13739集合,0平行,2.91s,2.95s已经过去
第1代:1150集合,0并行,0.18s,0.18s过去

INIT时间0.00s(经过0.00s)
MUT时间2.07s(经过2.28s)
GC时间3.09s(经过3.13s)
EXIT时间0.00s(0.00s已过)
总时间5.16s(已过5.41s)

重要的一行是 101,888个字节的最大驻留,这表示在任何给定的时刻,我的脚本最多使用101 kb的内存。我掠过的文件是44 MB。所以我认为判决是: readFile hGetContents 都是懒惰的。



后续问题:

为什么我在堆中看到7GB内存?对于在44 MB文件中阅读的脚本来说,这似乎非常高。



后续问题更新



看起来像堆中分配的几个内存对于Haskell来说并不是非典型的,所以不用担心。使用 ByteString s而不是字符串 s会降低内存使用量:

 堆中分配的81,617,024字节
GC
中复制的35,072字节最大居民身份78,832字节(1个样本)
26,960字节最大坡度
使用的总内存2 MB(由于分段造成的0 MB丢失)


解决方案

readFile 和 hGetContents 应该是懒惰的。尝试使用 + RTS -s 运行程序,并查看实际使用的内存量。你认为整个文件被读入内存的原因是什么?



至于你的问题的第二部分,懒惰IO有时是意外的根源空间泄露资源泄漏。不是真正的惰性IO本身的错误,而是确定它是否泄漏需要分析它的使用方式。


Suppose I have several 200mb+ files that I want to grep through. How would I do this in Haskell?

Here's my initial program:

import Data.List
import Control.Monad
import System.IO
import System.Environment

main = do
  filename <- liftM head getArgs
  contents <- liftM lines $ readFile filename
  putStrLn . unlines . filter (isPrefixOf "import") $ contents

This reads the whole file into memory before parsing through it. Then I went with this:

import Data.List
import Control.Monad
import System.IO
import System.Environment

main = do
  filename <- liftM head getArgs
  file <- (openFile filename ReadMode)
  contents <- liftM lines $ hGetContents file
  putStrLn . unlines . filter (isPrefixOf "import") $ contents

I thought since hGetContents is lazy, it will avoid reading the whole file into memory. But running both scripts under valgrind showed similar memory usage for both. So either my script is wrong, or valgrind is wrong. I compile the scripts using

ghc --make test.hs -prof

What am I missing? Bonus question: I see a lot of mentions on SO of how Lazy IO in Haskell is actually a bad thing. How / why would I use strict IO?

Update:

So it looks like I was wrong in my reading of valgrind. Using +RTS -s, here's what I get:

 7,807,461,968 bytes allocated in the heap
 1,563,351,416 bytes copied during GC
       101,888 bytes maximum residency (1150 sample(s))
        45,576 bytes maximum slop
             2 MB total memory in use (0 MB lost due to fragmentation)

Generation 0: 13739 collections,     0 parallel,  2.91s,  2.95s elapsed
Generation 1:  1150 collections,     0 parallel,  0.18s,  0.18s elapsed

INIT  time    0.00s  (  0.00s elapsed)
MUT   time    2.07s  (  2.28s elapsed)
GC    time    3.09s  (  3.13s elapsed)
EXIT  time    0.00s  (  0.00s elapsed)
Total time    5.16s  (  5.41s elapsed)

The important line is 101,888 bytes maximum residency, which says that at any given point my script was using 101 kb of memory at most. The file I was grepping through was 44 mb. So I think the verdict is: readFile and hGetContents are both lazy.

Follow-up question:

Why do I see 7gb of memory allocated on the heap? That seems really high for a script that's reading in a 44 mb file.

Update to follow-up question

Looks like a few gb of memory allocated on the heap is not atypical for Haskell, so no cause for concern. Using ByteStrings instead of Strings takes the memory usage down a lot:

  81,617,024 bytes allocated in the heap
      35,072 bytes copied during GC
      78,832 bytes maximum residency (1 sample(s))
      26,960 bytes maximum slop
           2 MB total memory in use (0 MB lost due to fragmentation)

解决方案

Both readFile and hGetContents should be lazy. Try running your program with +RTS -s and see how much memory is actually used. What makes you think the entire file is read into memory?

As for the second part of your question, lazy IO is sometimes at the root of unexpected space leaks or resource leaks. Not really the fault of lazy IO in and of itself, but determining whether its leaky requires analyzing how it's used.

这篇关于解析Haskell中的大型日志文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆