带有xml-conduit parseBytes的堆内存建立 [英] heap memory buildup with xml-conduit parseBytes
问题描述
我正在使用xml-conduit的流接口解析一些相当大的XML文件
I'm parsing some rather large XML files with xml-conduit's streaming interface https://hackage.haskell.org/package/xml-conduit-1.8.0/docs/Text-XML-Stream-Parse.html#v:parseBytes but I'm seeing this memory buildup (here on a small test file):
顶级用户在哪里:
实际数据不应占用太多堆-如果我进行序列化和重新读取,则常驻内存使用量为千字节,而此处为兆字节.
The actual data shouldn't take up that much heap – if I serialise and re-read, the resident memory use is kilobytes vs the megabytes here.
我设法用以下方法重现这一点的最小示例:
The minimal example I've managed to reproduce this with:
{-# LANGUAGE BangPatterns #-}
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Control.Monad
import Control.Monad.IO.Class
import Data.Conduit
import Data.Conduit.Binary (sourceFile)
import qualified Data.Conduit.List as CL
import Data.Text (Text)
import Text.XML.Stream.Parse
type Y = [(Text, Text)]
main :: IO ()
main = do
res1 <- runConduitRes $
sourceFile "test.xml"
.| Text.XML.Stream.Parse.parseBytes def
.| parseMain
.| CL.foldM get []
print res1
get :: (MonadIO m, Show a) => [a] -> [a] -> m [a]
get acc !vals = do
liftIO $! print vals -- this oughta force it?
return $! take 1 vals ++ acc
parseMain = void $ tagIgnoreAttrs "Period" parseDetails
parseDetails = many parseParam >>= yield
parseParam = tag' "param" parseParamAttrs $ \idAttr -> do
value <- content
return (idAttr, value)
parseParamAttrs = do
idAttr <- requireAttr "id"
attr "name"
return idAttr
推荐答案
如果我将get
更改为仅返回["hi"]
或其他内容,则不会得到结果.因此,似乎返回的文本仍然引用了它们所在的较大文本(例如零拷贝切片,请参见
If I change get
to just return ["hi"]
or something, I don't get the buildup. So it seems the returned texts keep some reference to the larger text they were in (e.g. zero-copy slicing, cf. comment at https://hackage.haskell.org/package/text-0.11.2.0/docs/Data-Text.html#g:18 ), so the rest of the text can't be garbage collected even though we're using only little parts.
我们的解决方法是在要产生的任何属性上使用Data.Text.copy
:
Our fix is to use Data.Text.copy
on any attributes we want to yield:
someattr <- requireAttr "n"
yield (T.copy someattr)
这使我们可以解析几乎恒定的内存使用情况.
which lets us parse with nearly constant memory use.
(我们可能会考虑使用 https://markkarpov.com /post/short-bs-and-text.html#shorttext 如果我们想节省更多的内存.)
(And we might consider using https://markkarpov.com/post/short-bs-and-text.html#shorttext if we want to save even more memory.)
这篇关于带有xml-conduit parseBytes的堆内存建立的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!