处理(太)许多XML文件(使用TagSoup) [英] Processing (too) many XML files (with TagSoup)

查看:140
本文介绍了处理(太)许多XML文件(使用TagSoup)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含约4500个XML(HTML5)文件的目录,我想创建一个他们数据的清单(本质上是 title base / @ href )。



为此,我一直使用一个函数来收集所有相关的文件路径,使用readFile,将它们发送到基于tagsoup的解析器,然后输出/格式化结果列表。

这适用于文件的子集,但最终会运行到 openFile:资源耗尽(打开的文件过多)错误。在做了一些阅读之后,这并不奇怪:我使用了 mapM parseMetaDataFile文件,它立即打开所有句柄。



我无法弄清楚如何解决这个问题。我试着读一下关于Iteratee的内容;我可以轻松地将它与Tagsoup挂钩吗?严格的IO,我用它的方式(heh),即使文件不是很大(平均28 KB),我的电脑也冻结了。



任何指针都会不胜感激。我意识到创建一个大列表的方法也可能失败,但4.5k元素不会那么长......另外,应该少一些 String 以及更多 ByteString 无处不在。



以下是一些代码。我很抱歉:

  import System.FilePath 
import Text.HTML.TagSoup

data MetaData =元数据字符串字符串派生(Show,Eq)

- |给定HTML输入,产生一个基本要素的MetaData结构。
- 显然应该说明错误,但在此简化。
readMetaData :: String - >元数据
readMetaData输入=元数据标题库
其中
title =
innerText $
(takeWhile(〜/ = TagClosetitle)。dropWhile(〜/ = TagOpen title[]))
tags
ase = fromAttribhref$ head $ dropWhile(〜/ = TagOpenbase[])tags
tags = parseTags input

- |从文件分析元数据。
parseMetaDataFile :: FilePath - > IO元数据
parseMetaDataFile路径= fmap readMetaData $ readFile路径

- |从给定的根目录获取我们感兴趣的文件的FilePath。
- 这里没有实现。
getHtmlFilePaths :: FilePath - > IO [FilePath]
getHtmlFilePaths root = undefined
$ b main :: IO
main = do
- 将为每个文件调用openFile,这会导致打开的文件过多。
metas < - mapM parseMetaDataFile =<< getHtmlFilePaths

- 使用metas进行操作,这会导致文件被实际读取。


解决方案

快速且肮脏的解决方案:

  parseMetaDataFile path = withFile path $ \h  - > 
res @(MetaData xy)< - fmap readMetaData $ hGetContents h
Control.Exception.evaluate(length(x ++ y))
return res

稍微好一点的解决方案是为<$ c写一个适当的 NFData 实例$ c> MetaData ,而不仅仅是使用评估。


I have a directory with about 4500 XML (HTML5) files, and I want to create a "manifest" of their data (essentially title and base/@href).

To this end, I've been using a function to collect all the relevant file paths, opening them with readFile, sending them into a tagsoup based parser and then outputting/formatting the resultant list.

This works for a subset of the files, but eventually runs into a openFile: resource exhausted (Too many open files) error. After doing some reading, this isn't so surprising: I'm using mapM parseMetaDataFile files which opens all the handles straight away.

What I can't figure out is how to work around the problem. I've tried reading a bit about Iteratee; Can I hook that up with Tagsoup easily? Strict IO, the way I used it anyway (heh), froze my computer even though the files aren't very big (28 KB on average).

Any pointers would be greatly appreciated. I realize the approach of creating a big list might fail as well, but 4.5k elements isn't that long... Also, there should probably be less String and more ByteString everywhere.

Here's some code. I apologize for the naivety:

import System.FilePath
import Text.HTML.TagSoup

data MetaData = MetaData String String deriving (Show, Eq)

-- | Given HTML input, produces a MetaData structure of its essentials.
-- Should obviously account for errors, but simplified here.
readMetaData :: String -> MetaData
readMetaData input = MetaData title base
 where
  title =
    innerText $
    (takeWhile (~/= TagClose "title") . dropWhile (~/= TagOpen "title" []))
    tags
  base = fromAttrib "href" $ head $ dropWhile (~/= TagOpen "base" []) tags
  tags = parseTags input

-- | Parses MetaData from a file.
parseMetaDataFile :: FilePath -> IO MetaData
parseMetaDataFile path = fmap readMetaData $ readFile path

-- | From a given root, gets the FilePaths of the files we are interested in.
-- Not implemented here.
getHtmlFilePaths :: FilePath -> IO [FilePath]
getHtmlFilePaths root = undefined

main :: IO
main = do
  -- Will call openFile for every file, which gives too many open files.
  metas <- mapM parseMetaDataFile =<< getHtmlFilePaths

  -- Do stuff with metas, which will cause files to actually be read.

解决方案

The quick and dirty solution:

parseMetaDataFile path = withFile path $ \h -> do
    res@(MetaData x y) <- fmap readMetaData $ hGetContents h
    Control.Exception.evaluate (length (x ++ y))
    return res

A slightly nicer solution is to write a proper NFData instance for MetaData, instead of just using evaluate.

这篇关于处理(太)许多XML文件(使用TagSoup)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆