处理(太)许多XML文件(使用TagSoup) [英] Processing (too) many XML files (with TagSoup)
问题描述
我有一个包含约4500个XML(HTML5)文件的目录,我想创建一个他们数据的清单(本质上是 title
和 base / @ href
)。
为此,我一直使用一个函数来收集所有相关的文件路径,使用readFile,将它们发送到基于tagsoup的解析器,然后输出/格式化结果列表。
这适用于文件的子集,但最终会运行到 openFile:资源耗尽(打开的文件过多)
错误。在做了一些阅读之后,这并不奇怪:我使用了 mapM parseMetaDataFile文件
,它立即打开所有句柄。
我无法弄清楚如何解决这个问题。我试着读一下关于Iteratee的内容;我可以轻松地将它与Tagsoup挂钩吗?严格的IO,我用它的方式(heh),即使文件不是很大(平均28 KB),我的电脑也冻结了。
任何指针都会不胜感激。我意识到创建一个大列表的方法也可能失败,但4.5k元素不会那么长......另外,应该少一些 String
以及更多 ByteString
无处不在。
以下是一些代码。我很抱歉:
import System.FilePath
import Text.HTML.TagSoup
data MetaData =元数据字符串字符串派生(Show,Eq)
- |给定HTML输入,产生一个基本要素的MetaData结构。
- 显然应该说明错误,但在此简化。
readMetaData :: String - >元数据
readMetaData输入=元数据标题库
其中
title =
innerText $
(takeWhile(〜/ = TagClosetitle)。dropWhile(〜/ = TagOpen title[]))
tags
ase = fromAttribhref$ head $ dropWhile(〜/ = TagOpenbase[])tags
tags = parseTags input
- |从文件分析元数据。
parseMetaDataFile :: FilePath - > IO元数据
parseMetaDataFile路径= fmap readMetaData $ readFile路径
- |从给定的根目录获取我们感兴趣的文件的FilePath。
- 这里没有实现。
getHtmlFilePaths :: FilePath - > IO [FilePath]
getHtmlFilePaths root = undefined
$ b main :: IO
main = do
- 将为每个文件调用openFile,这会导致打开的文件过多。
metas < - mapM parseMetaDataFile =<< getHtmlFilePaths
- 使用metas进行操作,这会导致文件被实际读取。
快速且肮脏的解决方案:
parseMetaDataFile path = withFile path $ \h - >
res @(MetaData xy)< - fmap readMetaData $ hGetContents h
Control.Exception.evaluate(length(x ++ y))
return res
稍微好一点的解决方案是为<$ c写一个适当的 NFData
实例$ c> MetaData ,而不仅仅是使用评估。
I have a directory with about 4500 XML (HTML5) files, and I want to create a "manifest" of their data (essentially title
and base/@href
).
To this end, I've been using a function to collect all the relevant file paths, opening them with readFile, sending them into a tagsoup based parser and then outputting/formatting the resultant list.
This works for a subset of the files, but eventually runs into a openFile: resource exhausted (Too many open files)
error. After doing some reading, this isn't so surprising: I'm using mapM parseMetaDataFile files
which opens all the handles straight away.
What I can't figure out is how to work around the problem. I've tried reading a bit about Iteratee; Can I hook that up with Tagsoup easily? Strict IO, the way I used it anyway (heh), froze my computer even though the files aren't very big (28 KB on average).
Any pointers would be greatly appreciated. I realize the approach of creating a big list might fail as well, but 4.5k elements isn't that long... Also, there should probably be less String
and more ByteString
everywhere.
Here's some code. I apologize for the naivety:
import System.FilePath
import Text.HTML.TagSoup
data MetaData = MetaData String String deriving (Show, Eq)
-- | Given HTML input, produces a MetaData structure of its essentials.
-- Should obviously account for errors, but simplified here.
readMetaData :: String -> MetaData
readMetaData input = MetaData title base
where
title =
innerText $
(takeWhile (~/= TagClose "title") . dropWhile (~/= TagOpen "title" []))
tags
base = fromAttrib "href" $ head $ dropWhile (~/= TagOpen "base" []) tags
tags = parseTags input
-- | Parses MetaData from a file.
parseMetaDataFile :: FilePath -> IO MetaData
parseMetaDataFile path = fmap readMetaData $ readFile path
-- | From a given root, gets the FilePaths of the files we are interested in.
-- Not implemented here.
getHtmlFilePaths :: FilePath -> IO [FilePath]
getHtmlFilePaths root = undefined
main :: IO
main = do
-- Will call openFile for every file, which gives too many open files.
metas <- mapM parseMetaDataFile =<< getHtmlFilePaths
-- Do stuff with metas, which will cause files to actually be read.
The quick and dirty solution:
parseMetaDataFile path = withFile path $ \h -> do
res@(MetaData x y) <- fmap readMetaData $ hGetContents h
Control.Exception.evaluate (length (x ++ y))
return res
A slightly nicer solution is to write a proper NFData
instance for MetaData
, instead of just using evaluate.
这篇关于处理(太)许多XML文件(使用TagSoup)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!