有了Haskell,我该如何处理大量的XML? [英] With Haskell, how do I process large volumes of XML?

查看:218
本文介绍了有了Haskell,我该如何处理大量的XML?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在探索堆栈溢出数据转储,从而利用友好XML和用正则表达式解析。我尝试使用各种Haskell XML库来查找特定用户按文档顺序排列的第一篇文章,所有这些都遇到了令人讨厌的颠簸。

TagSoup



I've been exploring the Stack Overflow data dumps and thus far taking advantage of the friendly XML and "parsing" with regular expressions. My attempts with various Haskell XML libraries to find the first post in document-order by a particular user all ran into nasty thrashing.

import Control.Monad
import Text.HTML.TagSoup

userid = "83805"

main = do
  posts <- liftM parseTags (readFile "posts.xml")
  print $ head $ map (fromAttrib "Id") $
                 filter (~== ("<row OwnerUserId=" ++ userid ++ ">"))
                 posts



hxt



hxt

import Text.XML.HXT.Arrow
import Text.XML.HXT.XPath

userid = "83805"

main = do
  runX $ readDoc "posts.xml" >>> posts >>> arr head
  where
    readDoc = readDocument [ (a_tagsoup, v_1)
                           , (a_parse_xml, v_1)
                           , (a_remove_whitespace, v_1)
                           , (a_issue_warnings, v_0)
                           , (a_trace, v_1)
                           ]

posts :: ArrowXml a => a XmlTree String
posts = getXPathTrees byUserId >>>
        getAttrValue "Id"
  where byUserId = "/posts/row/@OwnerUserId='" ++ userid ++ "'"



xml



xml

import Control.Monad
import Control.Monad.Error
import Control.Monad.Trans.Maybe
import Data.Either
import Data.Maybe
import Text.XML.Light

userid = "83805"

main = do
  [posts,votes] <- forM ["posts", "votes"] $
    liftM parseXML . readFile . (++ ".xml")
  let ps = elemNamed "posts" posts
  putStrLn $ maybe "<not present>" show
           $ filterElement (byUser userid) ps

elemNamed :: String -> [Content] -> Element
elemNamed name = head . filter ((==name).qName.elName) . onlyElems

byUser :: String -> Element -> Bool
byUser id e = maybe False (==id) (findAttr creator e)
  where creator = QName "OwnerUserId" Nothing Nothing

我哪里错了?使用Haskell处理大量XML文档的正确方法是什么?

Where did I go wrong? What is the proper way to process hefty XML documents with Haskell?

推荐答案

我注意到在所有这些情况下你都在执行String IO 。如果您希望高效地处理大量文本,您绝对必须使用Data.Text或Data.Bytestring(.Lazy),如String == [Char],这是对于非常大的平面文件的不恰当表示。

I notice you're doing String IO in all these cases. You absolutely must use either Data.Text or Data.Bytestring(.Lazy) if you hope to process large volumes of text efficiently, as String == [Char], which is an inappropriate representation for very large flat files.

这意味着您需要使用支持字节串的Haskell XML库。几十个xml库在这里: http:// hackage.haskell.org/packages/archive/pkg-list.html#cat:xml

That then implies you'll need to use a Haskell XML library that supports bytestrings. The couple-of-dozen xml libraries are here: http://hackage.haskell.org/packages/archive/pkg-list.html#cat:xml

我不确定哪些支持字节串,但这是你正在寻找的条件。

I'm not sure which support bytestrings, but that's the condition you're looking for.

这篇关于有了Haskell,我该如何处理大量的XML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆