DOM与SAX XML解析大文件 [英] DOM vs SAX XML parsing for large files

查看:179
本文介绍了DOM与SAX XML解析大文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

背景:



我有一个大的OWL(网络本体语言)文件(大约125MB或150万行)喜欢解析成一组制表符分隔值。我一直在研究SAX和DOM XML解析器,并找到以下内容:




  • SAX允许文档逐个节点读取,所以整个文档不在内存中。

  • DOM允许将整个文档一次性放入内存中,但具有可笑的开销。



对于大文件,SAX vs DOM:



据我所知,




  • 如果我使用 SAX ,我必须逐个遍历1.5百万行代码。

  • 如果我使用 DOM ,我会有一个很大的开销,但是结果将会快速返回。



问题:



我需要能够在相同长度的类似文件上多次使用此解析器。



因此,我应该使用哪个解析器?



奖金积分:是否有人知道任何好的解析器的JavaScript。我认识到很多都是针对Java的,但我对JavaScript更加舒服。

解决方案

认识StAX



就像 SAX StAX 遵循流式用于解析XML的编程模型。但是,它是 DOM 的双向读/写支持,易于使用和 SAX 的CPU之间的交叉和记忆效率。



SAX 是只读的,并且推送解析强制您以处理事件和错误,然后在解析输入。另一方面, StAX 是一个拉解析器,可以在需要时在客户端调用解析器的方法。这也意味着应用程序可以同时读取多个XML文件。



JAXP API比较



 
╔═════════════════════════════════════════════════════ ══════════════════╦═════════════════════════╦═════ ═════════════════════════════════════ b║JAXP API属性║StAX║SAX║DOM║TrAX║
╠═════════════════════════════ ═══════╬═════════════════════════╬════════════════ ══ ═══════╬═══════════════════════╬══════════════════ ═════════╣
║API风格║拉动事件;流║推送事件;流║在基于内存树的║XSLT基于规则的模板║
║易于使用║高║中║高║中║
║XPath能力║否║否║是║是║
║ CPU和内存利用率║好║好║取决于║取决于║
║仅向前║是║是║否║否║
║阅读║是║是║是║是║
║写作║是║否║是║是║
║创建,阅读,更新,删除(CRUD)║否║否║是║否║
╚═════════════════ ════════════════════╩═════════════════════════╩═══ ══════════════════════╩═══════════════════════╩═══ ════════════════════$ $ $ $ $ $ $ $ $ $ $ $ $ $ $参考:

StAX是否属于您的XML工具箱?


StAX 是一种拉式的API。如上所述,有Cursor和Event Iterator API。 API有阅读和写作两方面。它比SAX更加开发者友好。 StAX,如SAX,不需要将整个文档保存在内存中。然而,与SAX不同,整个文档不需要读取。部分可以跳过。这可能导致比SAX更好的性能。



Background:

I have a large OWL (Web Ontology Language) file (approximately 125MB or 1.5 million lines long) that I would like to parse into a set of tab delimited values. I have been researching about the SAX and DOM XML parsers, and found the following:

  • SAX allows for the document to be read node by node, so the whole document is not in memory.
  • DOM allows for the whole document to be placed in memory at once, but has a ridiculous amount of overhead.

SAX vs DOM for large files:

As far as I understand it,

  • If I use SAX, I would have to iterate through 1.5 millions lines of code, node by node.
  • If I use DOM, I would have a big overhead, but then the results would be returned rapidly.

Problem:

I need to be able to use this parser multiple times on similar files of the same length.

Therefore, which parser should I use?

Bonus points: Does anyone know any good parsers for JavaScript. I realize many are made for Java, but I am much more comfortable with JavaScript.

解决方案

Meet StAX

Just like SAX, StAX follows a Streaming programming model for parsing XML. But, it's a cross between DOM's bidirectional read/write support, its ease of use and SAX's CPU and memory efficiency.

SAX is read-only and does push parsing forcing you to handle events and errors right there and then while parsing the input. StAX on the other hand is a pull parser that lets the client call methods on the parser when needed. This also means that the application can read multiple XML files simultaneously.

JAXP API comparison

╔══════════════════════════════════════╦═════════════════════════╦═════════════════════════╦═══════════════════════╦═══════════════════════════╗
║          JAXP API Property           ║          StAX           ║           SAX           ║          DOM          ║           TrAX            ║
╠══════════════════════════════════════╬═════════════════════════╬═════════════════════════╬═══════════════════════╬═══════════════════════════╣
║ API Style                            ║ Pull events; streaming  ║ Push events; streaming  ║ In memory tree based  ║ XSLT Rule based templates ║
║ Ease of Use                          ║ High                    ║ Medium                  ║ High                  ║ Medium                    ║
║ XPath Capability                     ║ No                      ║ No                      ║ Yes                   ║ Yes                       ║
║ CPU and Memory Utilization           ║ Good                    ║ Good                    ║ Depends               ║ Depends                   ║
║ Forward Only                         ║ Yes                     ║ Yes                     ║ No                    ║ No                        ║
║ Reading                              ║ Yes                     ║ Yes                     ║ Yes                   ║ Yes                       ║
║ Writing                              ║ Yes                     ║ No                      ║ Yes                   ║ Yes                       ║
║ Create, Read, Update, Delete (CRUD)  ║ No                      ║ No                      ║ Yes                   ║ No                        ║
╚══════════════════════════════════════╩═════════════════════════╩═════════════════════════╩═══════════════════════╩═══════════════════════════╝

Reference:
Does StAX Belong in Your XML Toolbox?

StAX is a "pull" type of API. As discussed, there are Cursor and Event Iterator APIs. There are both reading and writing sides of the API. It is more developer friendly than SAX. StAX, like SAX, does not require an entire document to be held in memory. However, unlike SAX, an entire document need not be read. Portions can be skipped. This may result in even improved performance over SAX.

这篇关于DOM与SAX XML解析大文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆