在java中解析非常大的XML文档(以及更多) [英] Parsing very large XML documents (and a bit more) in java

查看:203
本文介绍了在java中解析非常大的XML文档(以及更多)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

(以下所有内容都是用Java编写的)

(All of the following is to be written in Java)

我必须构建一个应用程序,它将把XML文档视为非常大的输入。该文档是加密的 - 不是使用XMLsec,而是使用我客户的预先存在的加密算法 - 将分三个阶段处理:

I have to build an application that will take as input XML documents that are, potentially, very large. The document is encrypted -- not with XMLsec, but with my client's preexisting encryption algorithm -- will be processed in three phases:

首先,流将根据以下内容进行解密:上述算法。

First, the stream will be decrypted according to the aforementioned algorithm.

其次,扩展类(由我提供的API的第三方编写)将读取文件的某些部分。读取的数量是不可预测的 - 特别是它不能保证在文件的标题中,但可能出现在XML中的任何一点。

Second, an extension class (written by a third party to an API I am providing) will read some portion of the file. The amount that is read is not predictable -- in particular it is not guaranteed to be in the header of the file, but might occur at any point in the XML.

最后,另一个扩展类(相同的交易)将输入XML细分为1..n子集文档。这些可能会在某些部分与第二个操作处理的文档部分重叠,即:我相信我需要回放我用来处理这个对象的任何机制。

Lastly, another extension class (same deal) will subdivide the input XML into 1..n subset documents. It is possible that these will in some part overlap the portion of the document dealt with by the second operation, ie: I believe I will need to rewind whatever mechanism I am using to deal with this object.

这是我的问题:

有没有办法在没有一次将整个数据读入内存的情况下执行此操作?显然我可以将解密实现为输入流过滤器,但我不确定是否可以按照我描述的方式解析XML;通过遍历,需要收集第二步的信息,然后通过倒回文档并再次传递它以将其拆分为作业,理想情况下释放文档的所有不再使用的部分它们已被传递。

Is there a way to do this without ever reading the entire piece of data into memory at one time? Obviously I can implement the decryption as an input stream filter, but I'm not sure if it's possible to parse XML in the way I'm describing; by walking over as much of the document is required to gather the second step's information, and then by rewinding the document and passing over it again to split it into jobs, ideally releasing all of the parts of the document that are no longer in use after they have been passed.

推荐答案

Stax是正确的方法。我建议你查看 Woodstox

Stax is the right way. I would recommend looking at Woodstox

这篇关于在java中解析非常大的XML文档(以及更多)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆