有了Haskell,我该如何处理大量的XML? [英] With Haskell, how do I process large volumes of XML?
问题描述
我一直在探索堆栈溢出数据转储,从而利用友好XML和用正则表达式解析。我尝试使用各种Haskell XML库来查找特定用户按文档顺序排列的第一篇文章,所有这些都遇到了令人讨厌的颠簸。
TagSoup
I've been exploring the Stack Overflow data dumps and thus far taking advantage of the friendly XML and "parsing" with regular expressions. My attempts with various Haskell XML libraries to find the first post in document-order by a particular user all ran into nasty thrashing.
import Control.Monad
import Text.HTML.TagSoup
userid = "83805"
main = do
posts <- liftM parseTags (readFile "posts.xml")
print $ head $ map (fromAttrib "Id") $
filter (~== ("<row OwnerUserId=" ++ userid ++ ">"))
posts
hxt
hxt
import Text.XML.HXT.Arrow
import Text.XML.HXT.XPath
userid = "83805"
main = do
runX $ readDoc "posts.xml" >>> posts >>> arr head
where
readDoc = readDocument [ (a_tagsoup, v_1)
, (a_parse_xml, v_1)
, (a_remove_whitespace, v_1)
, (a_issue_warnings, v_0)
, (a_trace, v_1)
]
posts :: ArrowXml a => a XmlTree String
posts = getXPathTrees byUserId >>>
getAttrValue "Id"
where byUserId = "/posts/row/@OwnerUserId='" ++ userid ++ "'"
xml
xml
import Control.Monad
import Control.Monad.Error
import Control.Monad.Trans.Maybe
import Data.Either
import Data.Maybe
import Text.XML.Light
userid = "83805"
main = do
[posts,votes] <- forM ["posts", "votes"] $
liftM parseXML . readFile . (++ ".xml")
let ps = elemNamed "posts" posts
putStrLn $ maybe "<not present>" show
$ filterElement (byUser userid) ps
elemNamed :: String -> [Content] -> Element
elemNamed name = head . filter ((==name).qName.elName) . onlyElems
byUser :: String -> Element -> Bool
byUser id e = maybe False (==id) (findAttr creator e)
where creator = QName "OwnerUserId" Nothing Nothing
我哪里错了?使用Haskell处理大量XML文档的正确方法是什么?
Where did I go wrong? What is the proper way to process hefty XML documents with Haskell?
推荐答案
我注意到在所有这些情况下你都在执行String IO 。如果您希望高效地处理大量文本,您绝对必须使用Data.Text或Data.Bytestring(.Lazy),如String == [Char],这是对于非常大的平面文件的不恰当表示。
I notice you're doing String IO in all these cases. You absolutely must use either Data.Text or Data.Bytestring(.Lazy) if you hope to process large volumes of text efficiently, as String == [Char], which is an inappropriate representation for very large flat files.
这意味着您需要使用支持字节串的Haskell XML库。几十个xml库在这里: http:// hackage.haskell.org/packages/archive/pkg-list.html#cat:xml
That then implies you'll need to use a Haskell XML library that supports bytestrings. The couple-of-dozen xml libraries are here: http://hackage.haskell.org/packages/archive/pkg-list.html#cat:xml
我不确定哪些支持字节串,但这是你正在寻找的条件。
I'm not sure which support bytestrings, but that's the condition you're looking for.
这篇关于有了Haskell,我该如何处理大量的XML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!