使用Node.js和XPath对页面进行高效解析 [英] Performant parsing of pages with Node.js and XPath
本文介绍了使用Node.js和XPath对页面进行高效解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在使用Node.js进行一些网络抓取。我想使用XPath,因为我可以使用几种GUI半自动生成它。问题是我无法找到有效的方法。
I'm into some web scraping with Node.js. I'd like to use XPath as I can generate it semi-automatically with several sorts of GUI. The problem is that I cannot find a way to do this effectively.
-
jsdom
非常慢。它在一分钟左右的时间内解析了500KiB文件,占用了大量CPU和大量内存。 - 用于HTML解析的流行库(例如
cheerio
)既不支持XPath,也不暴露符合W3C的DOM。 - 显然,有效的HTML解析在WebKit中实现,因此使用
幻像
或casper
将是一个选项,但那些需要以特殊方式运行,而不仅仅是node< script>
。我不能依赖这种变化所暗示的风险。例如,找到如何使用幻像
运行node-inspector
要困难得多。 -
Spooky
是一个选项,但它是错误足够,所以它在我的机器上根本没有运行。
jsdom
is extremely slow. It's parsing 500KiB file in a minute or so with full CPU load and a heavy memory footprint.- Popular libraries for HTML parsing (e.g.
cheerio
) neither support XPath, nor expose W3C-compliant DOM. - Effective HTML parsing is, obviously, implemented in WebKit, so using
phantom
orcasper
would be an option, but those require to be running in a special way, not justnode <script>
. I cannot rely on the risk implied by this change. For example, it's much more difficult to find how to runnode-inspector
withphantom
. Spooky
is an option, but it's buggy enough, so that it didn't run at all on my machine.
解析HTML页面的正确方法是什么那么XPath?
What's the right way to parse an HTML page with XPath then?
推荐答案
你可以分几步完成。
- 使用
parse5
解析HTML。坏的部分是结果不是DOM。虽然它足够快且W3C兼容。 - 使用
xmlserializer
将其序列化为XHTML,它接受<$ c $的类似DOM的结构c> parse5 作为输入。 - 使用
xmldom
再次解析XHTML。现在你终于有了这个DOM。 -
xpath
库构建于xmldom
,允许您运行XPath查询。请注意,XHTML有自己的命名空间,而// a
等查询将无效。
- Parse HTML with
parse5
. The bad part is that the result is not DOM. Though it's fast enough and W3C-compiant. - Serialize it to XHTML with
xmlserializer
that accepts DOM-like structures ofparse5
as input. - Parse that XHTML again with
xmldom
. Now you finally have that DOM. - The
xpath
library builds uponxmldom
, allowing you to run XPath queries. Be aware that XHTML has its own namespace, and queries like//a
won't work.
最后你得到这样的东西。
Finally you get something like this.
const fs = require('mz/fs');
const xpath = require('xpath');
const parse5 = require('parse5');
const xmlser = require('xmlserializer');
const dom = require('xmldom').DOMParser;
(async () => {
const html = await fs.readFile('./test.htm');
const document = parse5.parse(html.toString());
const xhtml = xmlser.serializeToString(document);
const doc = new dom().parseFromString(xhtml);
const select = xpath.useNamespaces({"x": "http://www.w3.org/1999/xhtml"});
const nodes = select("//x:a/@href", doc);
console.log(nodes);
})();
这篇关于使用Node.js和XPath对页面进行高效解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文