Haskell用懒惰的mmap阅读最后一行 [英] Haskell Read Last Line with a Lazy mmap

查看:92
本文介绍了Haskell用懒惰的mmap阅读最后一行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想读取文件的最后一行,并确保它具有与我的第一行相同的字段数--我不在乎中间的任何内容.我使用mmap是因为它可以快速访问大型文件,但是却遇到了无法理解Haskell或惰性的问题.

I want to read the last line of my file and make sure it has the same number of fields as my first---I don't care about anything in the middle. I'm using mmap because it's fast for random access on large files, but am encountering problems not understanding Haskell or laziness.

λ> import qualified Data.ByteString.Lazy.Char8 as LB
λ> import System.IO.MMap
λ> outh <- mmapFileByteStringLazy fname Nothing 
λ> LB.length outh
87094896
λ> LB.takeWhile (`notElem` "\n") outh
"\"Field1\",\"Field2\",

太好了.

来自此处,我知道

takeWhileR p xs等效于reverse(takeWhileL p(reverse xs)).

takeWhileR p xs is equivalent to reverse (takeWhileL p (reverse xs)).

让我们做到这一点.就是说,让我们通过反转我的惰性字节串来获取最后一行,而不像以前那样使用"\ n",然后对其进行反转.懒惰使我认为编译器会让我轻松地做到这一点.

So let's make this. That is, let's get the last line by reversing my lazy bytestring, taking while not "\n" just as before, then unreversing it. Laziness makes me think the compiler will let me do this easily.

所以尝试一下:

LB.reverse (LB.takeWhile (`notElem` "\n") (LB.reverse outh))

我希望看到的是:

"\"val1\",\"val2\",

相反,这使我的会话崩溃.

Instead, this crashes my session.

Segmentation fault (core dumped)

问题:

  1. 我懒惰,字节串,mmap库或Haskell在做什么?
  2. 如何正确并以内存效率获得这条线? (答案可能使用外来指针而不是惰性字节串吗?)

对于其他读者,如果您想获得最后一行,则可能会找到答案中所述的非常快速且合适的方法:

For other readers, if you're looking to get the last line, you may find a very fast and suitable method described in the answer here: hSeek and SeekFromEnd in Haskell

在此线程中,我正在寻找使用mmap的解决方案.

In this thread, I'm looking specifically for a solution using mmap.

推荐答案

我更喜欢使用

I would prefer the use of bytestring-mmap made by the same author as bytestring. In either case, all you need is

import System.IO.Posix.MMap (unsafeMMapFile)
import qualified Data.ByteString.Char8 as BS

main = do
   -- can be swapped out for `mmapFileByteString` from `mmap`
  bs <- unsafeMMapFile "file.txt"

  let (firstLine, _) = BS.break (== '\n') bs
      (_, lastLine) = BS.breakEnd (== '\n') bs

  putStrLn $ "First line: " ++ BS.unpack firstLine
  putStrLn $ "Last line: " ++ BS.unpack lastLine

这也可以立即运行,没有额外的分配.和以前一样,要注意的是,许多文件都以换行符结尾,因此人们可能想让BS.breakEnd (== '\n') (init bs)忽略最后一个\n字符.

This runs instantly too, with no extra allocations. As before, there is the caveat that many files end in newlines, so one may want to have BS.breakEnd (== '\n') (init bs) to ignore the last \n character.

此外,我不建议反转字节串-这将至少需要一些分配,在这种情况下,这是完全可以避免的.即使您使用了一个懒惰的字节串,您仍然要付出遍历该字节串的所有块的费用(希望此时甚至不应该构建这些块).就是说,您的反向代码应该起作用.我认为mmap会导致某些问题(可能是该软件包,因为使用严格的字节串执行相同的操作就可以了).

Also, I would not recommend reversing the bytestring - that will require at least some allocations, which are in this case completely avoidable. Even if you use a lazy bytestring, you still pay the cost of going through all the chunks of the bytestring (which hopefully shouldn't even have been constructed at this point). That said, your reversing code should work. I reckon something is off with mmap (probably the package as the doing the same thing with a strict bytestring works just fine).

我不确定

I'm not sure what the problem is with the functions in System.IO. The following runs instantly on my laptop, file file.txt being almost 4GB. It isn't elegant, but it is certainly efficient.

import System.IO

hGetLastLine :: Handle -> IO String
hGetLastLine hdl = go "" (negate 1)
  where
  go s i = do
    hSeek hdl SeekFromEnd i
    c <- hGetChar hdl
    if c == '\n'
      then pure s
      else go (c:s) (i-1)


main = do
  handle <- openFile "file.txt" ReadMode

  firstLine <- hGetLine handle
  putStrLn $ "First line: " ++ firstLine

  lastLine <- hGetLastLine handle
  putStrLn $ "Last line: " ++ lastLine

这篇关于Haskell用懒惰的mmap阅读最后一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆