什么是用于Ruby的快速XML解析器? [英] What are fast XML parsers for Ruby?

查看：73 发布时间：2020/5/25 0:36:24 ruby xml parsing

本文介绍了什么是用于Ruby的快速XML解析器?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用的Nokogiri可以很好地处理小型文档.但是对于一个180KB的HTML文件，我必须通过ulimit -s增加进程堆栈的大小，并且解析和XPath查询会花费很长时间.

I am using Nokogiri which works for small documents well. But for a 180KB HTML file I have to increase the process stack size, via ulimit -s, and the parsing and XPath queries take a long time.

使用现有的Ruby发行版有更快的方法吗?

Are there faster methods available using a stock Ruby distribution?

我已经习惯了XPath，但是该解决方案不一定需要支持XPath.

I am getting used to XPath, but the solution does not necessarily need to support XPath.

条件是:

速写.
快速执行.
强大的解析器.

推荐答案

Nokogiri基于libxml2，它是所有语言中最快的XML/HTML解析器之一.它是用C语言编写的，但是有许多语言的绑定.

Nokogiri is based on libxml2, which is one of the fastest XML/HTML parsers in any language. It is written in C, but there are bindings in many languages.

问题在于文件越复杂，在内存中构建完整的DOM结构所花费的时间就越长.与其他解析方法相比，创建DOM速度慢且占用更多内存(通常，整个DOM必须适合内存). XPath依赖于此DOM.

The problem is that the more complex the file, the longer it takes to build a complete DOM structure in memory. Creating a DOM is slower and more memory-hungry than other parsing methods (generally the entire DOM must fit into memory). XPath relies on this DOM.

SAX通常是人们为了提高速度或处理内存不足的大型文档而使用的.它是事件驱动的:它通知您一个开始元素，结束元素等，并编写处理程序以对它们作出反应.这有点痛苦，因为您最终自己要跟踪状态(例如，您在内部"的哪些元素).

SAX is often what people turn to for speed or for large documents that don't fit into memory. It is more event driven: it notifies you of a start element, end element, etc, and you write handlers to react to them. It's a bit of a pain because you end up keeping track of state yourself (e.g. which elements you're "inside").

有一个中间立场:某些解析器具有拉解析"功能，您可以在其中进行类似光标的导航.您仍然可以按顺序访问每个节点，但是您可以快进"到您不感兴趣的元素的末尾.它具有SAX的速度，但具有许多用途的更好的界面.我不知道Nokogiri是否可以针对HTML执行此操作，但是我会研究其 Reader API 如果您有兴趣.

There is a middle ground: some parsers have a "pull parsing" capability where you have a cursor-like navigation. You still visit each node sequentially, but you can "fast-forward" to the end of an element you're not interested in. It's got the speed of SAX but a better interface for many uses. I don't know if Nokogiri can do this for HTML, but I'd look into its Reader API if you're interested.

请注意，Nokogiri对格式错误的标记(例如真实的HTML)也非常宽容，仅此一项就使其成为HTML解析的很好选择.

Note that Nokogiri is also very lenient with malformed markup (such as real-world HTML) and this alone makes it a very good choice for HTML parsing.

这篇关于什么是用于Ruby的快速XML解析器?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

什么是用于Ruby的快速XML解析器? [英] What are fast XML parsers for Ruby?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

什么是用于Ruby的快速XML解析器? [英] What are fast XML parsers for Ruby?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭