如何使用xml-conduit Cursor Interface从大型XML文件(大约30G)中提取信息 [英] How to use the xml-conduit Cursor Interface for information extraction from a large XML file (around 30G)
问题描述
以下问题基于此问题的接受答案。已接受答案的作者表示, xml-conduit
中的流式助手API未更新多年(来源: SO问题的接受答案),他建议 Cursor
界面。
The following question is based upon the accepted answer of this question. The author of the accepted answer said that the streaming helper API in xml-conduit
was not updated for years (source: accepted answer of SO question), and he recommends the Cursor
interface.
基于第一个问题的解决方案,我编写了以下使用 Cursor
的接口的haskell代码> xml-conduit
包。
Based on the solution of the first question, I wrote the following haskell code which uses the Cursor
interface of xml-conduit
package.
import Text.XML as XML (readFile, def)
import Text.XML.Cursor (Cursor, ($/), (&/), ($//), (>=>),
fromDocument, element, content)
import Data.Monoid (mconcat)
import Filesystem.Path (FilePath)
import Filesystem.Path.CurrentOS (fromText)
data Page = Page
{ title :: Text
} deriving (Show)
parse :: FilePath -> IO ()
parse path = do
doc <- XML.readFile def path
let cursor = fromDocument doc
let pages = cursor $// element "page" >=> parseTitle
writeFile "output.txt" ""
mapM_ ((appendFile "output.txt") . (\x -> x ++ "\n") . show) pages
parseTitle :: Cursor -> [Page]
parseTitle c = do
let titleText = c $/ element "title" &/ content
[Page (mconcat titleText)]
main :: IO ()
main = parse (fromText "input.xml")
适用于小型XML文件。但是,当代码在30G XML文件上运行时,操作系统会终止执行。
This code works on small XML files. However, when the code is run on a 30G XML file, the execution is killed by the OS.
我怎样才能让这段代码在一个非常大的XML文件上工作?
How can I make this code work on a very large XML file?
推荐答案
The Cursor
module requires that the entire contents be in memory, which seems to not be possible in this case. If you want to process files that large, you'll need to use the streaming interface.
这篇关于如何使用xml-conduit Cursor Interface从大型XML文件(大约30G)中提取信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!