什么是用于Ruby的快速XML解析器? [英] What are fast XML parsers for Ruby?

查看:73
本文介绍了什么是用于Ruby的快速XML解析器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的Nokogiri可以很好地处理小型文档.但是对于一个180KB的HTML文件,我必须通过ulimit -s增加进程堆栈的大小,并且解析和XPath查询会花费很长时间.

I am using Nokogiri which works for small documents well. But for a 180KB HTML file I have to increase the process stack size, via ulimit -s, and the parsing and XPath queries take a long time.

使用现有的Ruby发行版有更快的方法吗?

Are there faster methods available using a stock Ruby distribution?

我已经习惯了XPath,但是该解决方案不一定需要支持XPath.

I am getting used to XPath, but the solution does not necessarily need to support XPath.

条件是:

  1. 速写.
  2. 快速执行.
  3. 强大的解析器.

推荐答案

Nokogiri基于libxml2,它是所有语言中最快的XML/HTML解析器之一.它是用C语言编写的,但是有许多语言的绑定.

Nokogiri is based on libxml2, which is one of the fastest XML/HTML parsers in any language. It is written in C, but there are bindings in many languages.

问题在于文件越复杂,在内存中构建完整的DOM结构所花费的时间就越长.与其他解析方法相比,创建DOM速度慢且占用更多内存(通常,整个DOM必须适合内存). XPath依赖于此DOM.

The problem is that the more complex the file, the longer it takes to build a complete DOM structure in memory. Creating a DOM is slower and more memory-hungry than other parsing methods (generally the entire DOM must fit into memory). XPath relies on this DOM.

SAX通常是人们为了提高速度或处理内存不足的大型文档而使用的.它是事件驱动的:它通知您一个开始元素,结束元素等,并编写处理程序以对它们作出反应.这有点痛苦,因为您最终自己要跟踪状态(例如,您在内部"的哪些元素).

SAX is often what people turn to for speed or for large documents that don't fit into memory. It is more event driven: it notifies you of a start element, end element, etc, and you write handlers to react to them. It's a bit of a pain because you end up keeping track of state yourself (e.g. which elements you're "inside").

有一个中间立场:某些解析器具有拉解析"功能,您可以在其中进行类似光标的导航.您仍然可以按顺序访问每个节点,但是您可以快进"到您不感兴趣的元素的末尾.它具有SAX的速度,但具有许多用途的更好的界面.我不知道Nokogiri是否可以针对HTML执行此操作,但是我会研究其 Reader API 如果您有兴趣.

There is a middle ground: some parsers have a "pull parsing" capability where you have a cursor-like navigation. You still visit each node sequentially, but you can "fast-forward" to the end of an element you're not interested in. It's got the speed of SAX but a better interface for many uses. I don't know if Nokogiri can do this for HTML, but I'd look into its Reader API if you're interested.

请注意,Nokogiri对格式错误的标记(例如真实的HTML)也非常宽容,仅此一项就使其成为HTML解析的很好选择.

Note that Nokogiri is also very lenient with malformed markup (such as real-world HTML) and this alone makes it a very good choice for HTML parsing.

这篇关于什么是用于Ruby的快速XML解析器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆