如何使用xml-conduit Cursor Interface从大型XML文件(大约30G)中提取信息 [英] How to use the xml-conduit Cursor Interface for information extraction from a large XML file (around 30G)

查看:128
本文介绍了如何使用xml-conduit Cursor Interface从大型XML文件(大约30G)中提取信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下问题基于此问题的接受答案。已接受答案的作者表示, xml-conduit 中的流式助手API未更新多年(来源: SO问题的接受答案),他建议 Cursor 界面。

The following question is based upon the accepted answer of this question. The author of the accepted answer said that the streaming helper API in xml-conduit was not updated for years (source: accepted answer of SO question), and he recommends the Cursor interface.

基于第一个问题的解决方案,我编写了以下使用 Cursor 接口的haskell代码> xml-conduit 包。

Based on the solution of the first question, I wrote the following haskell code which uses the Cursor interface of xml-conduit package.

import Text.XML as XML (readFile, def)
import Text.XML.Cursor (Cursor, ($/), (&/), ($//), (>=>), 
    fromDocument, element, content)
import Data.Monoid (mconcat)
import Filesystem.Path (FilePath)
import Filesystem.Path.CurrentOS (fromText)

data Page = Page
    { title :: Text
    } deriving (Show)

parse :: FilePath -> IO ()
parse path = do
    doc <- XML.readFile def path
    let cursor = fromDocument doc
    let pages = cursor $// element "page" >=> parseTitle
    writeFile "output.txt" ""
    mapM_ ((appendFile "output.txt") . (\x -> x ++ "\n") . show) pages

parseTitle :: Cursor -> [Page]
parseTitle c = do
    let titleText = c $/ element "title" &/ content
    [Page (mconcat titleText)]

main :: IO ()
main = parse (fromText "input.xml")

适用于小型XML文件。但是,当代码在30G XML文件上运行时,操作系统会终止执行。

This code works on small XML files. However, when the code is run on a 30G XML file, the execution is killed by the OS.

我怎样才能让这段代码在一个非常大的XML文件上工作?

How can I make this code work on a very large XML file?

推荐答案

Cursor 模块要求整个内容在内存中,这在这种情况下似乎是不可能的。如果您想处理的文件很大,则需要使用流媒体界面。

The Cursor module requires that the entire contents be in memory, which seems to not be possible in this case. If you want to process files that large, you'll need to use the streaming interface.

这篇关于如何使用xml-conduit Cursor Interface从大型XML文件(大约30G)中提取信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆