处理（太）许多XML文件（使用TagSoup） [英] Processing (too) many XML files (with TagSoup)

查看：140 发布时间：2018/6/5 11:29:26 xml haskell io lazy-evaluation tag-soup

本文介绍了处理（太）许多XML文件（使用TagSoup）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个包含约4500个XML（HTML5）文件的目录，我想创建一个他们数据的清单（本质上是 title 和 base / @ href ）。

为此，我一直使用一个函数来收集所有相关的文件路径，使用readFile，将它们发送到基于tagsoup的解析器，然后输出/格式化结果列表。

这适用于文件的子集，但最终会运行到 openFile：资源耗尽（打开的文件过多）错误。在做了一些阅读之后，这并不奇怪：我使用了 mapM parseMetaDataFile文件，它立即打开所有句柄。

我无法弄清楚如何解决这个问题。我试着读一下关于Iteratee的内容;我可以轻松地将它与Tagsoup挂钩吗？严格的IO，我用它的方式（heh），即使文件不是很大（平均28 KB），我的电脑也冻结了。

任何指针都会不胜感激。我意识到创建一个大列表的方法也可能失败，但4.5k元素不会那么长......另外，应该少一些 String 以及更多 ByteString 无处不在。

以下是一些代码。我很抱歉：

  import System.FilePath 
 import Text.HTML.TagSoup 
 
 data MetaData =元数据字符串字符串派生（Show，Eq）
 
  -  |给定HTML输入，产生一个基本要素的MetaData结构。 
  - 显然应该说明错误，但在此简化。 
 readMetaData :: String  - >元数据
 readMetaData输入=元数据标题库
其中
 title = 
 innerText $ 
（takeWhile（〜/ = TagClosetitle）。dropWhile（〜/ = TagOpen title[]））
 tags 
ase = fromAttribhref$ head $ dropWhile（〜/ = TagOpenbase[]）tags 
 tags = parseTags input 
 
  -  |从文件分析元数据。 
 parseMetaDataFile :: FilePath  - > IO元数据
 parseMetaDataFile路径= fmap readMetaData $ readFile路径
 
  -  |从给定的根目录获取我们感兴趣的文件的FilePath。
  - 这里没有实现。 
 getHtmlFilePaths :: FilePath  - > IO [FilePath] 
 getHtmlFilePaths root = undefined 
 $ b main :: IO 
 main = do 
  - 将为每个文件调用openFile，这会导致打开的文件过多。 
 metas < -  mapM parseMetaDataFile =<< getHtmlFilePaths 
 
  - 使用metas进行操作，这会导致文件被实际读取。

解决方案

快速且肮脏的解决方案：

  parseMetaDataFile path = withFile path $ \h  - > 
 res @（MetaData xy）<  -  fmap readMetaData $ hGetContents h 
 Control.Exception.evaluate（length（x ++ y））
 return res

稍微好一点的解决方案是为<$ c写一个适当的 NFData 实例$ c> MetaData ，而不仅仅是使用评估。

I have a directory with about 4500 XML (HTML5) files, and I want to create a "manifest" of their data (essentially title and base/@href).

To this end, I've been using a function to collect all the relevant file paths, opening them with readFile, sending them into a tagsoup based parser and then outputting/formatting the resultant list.

This works for a subset of the files, but eventually runs into a openFile: resource exhausted (Too many open files) error. After doing some reading, this isn't so surprising: I'm using mapM parseMetaDataFile files which opens all the handles straight away.

What I can't figure out is how to work around the problem. I've tried reading a bit about Iteratee; Can I hook that up with Tagsoup easily? Strict IO, the way I used it anyway (heh), froze my computer even though the files aren't very big (28 KB on average).

Any pointers would be greatly appreciated. I realize the approach of creating a big list might fail as well, but 4.5k elements isn't that long... Also, there should probably be less String and more ByteString everywhere.

Here's some code. I apologize for the naivety:

import System.FilePath
import Text.HTML.TagSoup

data MetaData = MetaData String String deriving (Show, Eq)

-- | Given HTML input, produces a MetaData structure of its essentials.
-- Should obviously account for errors, but simplified here.
readMetaData :: String -> MetaData
readMetaData input = MetaData title base
 where
  title =
    innerText $
    (takeWhile (~/= TagClose "title") . dropWhile (~/= TagOpen "title" []))
    tags
  base = fromAttrib "href" $ head $ dropWhile (~/= TagOpen "base" []) tags
  tags = parseTags input

-- | Parses MetaData from a file.
parseMetaDataFile :: FilePath -> IO MetaData
parseMetaDataFile path = fmap readMetaData $ readFile path

-- | From a given root, gets the FilePaths of the files we are interested in.
-- Not implemented here.
getHtmlFilePaths :: FilePath -> IO [FilePath]
getHtmlFilePaths root = undefined

main :: IO
main = do
  -- Will call openFile for every file, which gives too many open files.
  metas <- mapM parseMetaDataFile =<< getHtmlFilePaths

  -- Do stuff with metas, which will cause files to actually be read.

解决方案

The quick and dirty solution:

parseMetaDataFile path = withFile path $ \h -> do
    res@(MetaData x y) <- fmap readMetaData $ hGetContents h
    Control.Exception.evaluate (length (x ++ y))
    return res

A slightly nicer solution is to write a proper NFData instance for MetaData, instead of just using evaluate.

这篇关于处理（太）许多XML文件（使用TagSoup）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

处理（太）许多XML文件（使用TagSoup） [英] Processing (too) many XML files (with TagSoup)

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

处理（太）许多XML文件（使用TagSoup） [英] Processing (too) many XML files (with TagSoup)

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭