使用Node.js和XPath对HTML页面进行高效的解析 [英] Performant parsing of HTML pages with Node.js and XPath

查看:255
本文介绍了使用Node.js和XPath对HTML页面进行高效的解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Node.js进行一些Web抓取.我想使用XPath,因为我可以使用几种GUI半自动生成它.问题是我找不到有效的方法.

I'm into some web scraping with Node.js. I'd like to use XPath as I can generate it semi-automatically with several sorts of GUI. The problem is that I cannot find a way to do this effectively.

  1. jsdom非常慢.它会在一分钟左右的时间内解析500KiB文件,而这会占用全部CPU并占用大量内存.
  2. 用于HTML解析的常用库(例如cheerio)既不支持XPath,也不公开W3C兼容的DOM.
  3. 很明显,有效的HTML解析是在WebKit中实现的,因此可以选择使用phantomcasper,但是它们需要以特殊的方式运行,而不仅仅是node <script>.我不能依靠此更改所隐含的风险.例如,很难找到如何在phantom上运行node-inspector的方法.
  4. Spooky是一个选项,但它够笨拙的,因此它根本无法运行在我的机器上.
  1. jsdom is extremely slow. It's parsing 500KiB file in a minute or so with full CPU load and a heavy memory footprint.
  2. Popular libraries for HTML parsing (e.g. cheerio) neither support XPath, nor expose W3C-compliant DOM.
  3. Effective HTML parsing is, obviously, implemented in WebKit, so using phantom or casper would be an option, but those require to be running in a special way, not just node <script>. I cannot rely on the risk implied by this change. For example, it's much more difficult to find how to run node-inspector with phantom.
  4. Spooky is an option, but it's buggy enough, so that it didn't run at all on my machine.

然后用XPath解析HTML页面的正确方法是什么?

What's the right way to parse an HTML page with XPath then?

推荐答案

您可以按照几个步骤进行操作.

You can do so in several steps.

  1. 使用parse5解析HTML.不好的部分是结果不是DOM.尽管速度足够快且兼容W3C.
  2. 使用xmlserializer将其序列化为XHTML,该xmlserializer接受parse5的类似DOM的结构作为输入.
  3. 使用xmldom再次解析该XHTML.现在您终于有了该DOM.
  4. xpath库基于xmldom构建,允许您运行XPath查询.请注意,XHTML具有自己的名称空间,并且//a之类的查询将不起作用.
  1. Parse HTML with parse5. The bad part is that the result is not DOM. Though it's fast enough and W3C-compiant.
  2. Serialize it to XHTML with xmlserializer that accepts DOM-like structures of parse5 as input.
  3. Parse that XHTML again with xmldom. Now you finally have that DOM.
  4. The xpath library builds upon xmldom, allowing you to run XPath queries. Be aware that XHTML has its own namespace, and queries like //a won't work.

最后,您得到类似的东西.

Finally you get something like this.

const fs = require('mz/fs');
const xpath = require('xpath');
const parse5 = require('parse5');
const xmlser = require('xmlserializer');
const dom = require('xmldom').DOMParser;

(async () => {
    const html = await fs.readFile('./test.htm');
    const document = parse5.parse(html.toString());
    const xhtml = xmlser.serializeToString(document);
    const doc = new dom().parseFromString(xhtml);
    const select = xpath.useNamespaces({"x": "http://www.w3.org/1999/xhtml"});
    const nodes = select("//x:a/@href", doc);
    console.log(nodes);
})();

这篇关于使用Node.js和XPath对HTML页面进行高效的解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆