解析非常大的XML文件并编组到Java对象 [英] Parsing very large XML files and marshalling to Java Objects

查看:47
本文介绍了解析非常大的XML文件并编组到Java对象的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到以下问题:我有非常大的XML文件(例如300多个Megs),我需要解析它们,以便将它们的某些值添加到数据库中.这些文件的结构也非常复杂.我想使用Stax Parser,因为它提供了一次仅对XML文件的一部分进行拉式解析(并因此进行处理)的良好可能性,因此不会将整个内容加载到内存中,但另一方面却可以通过Stax(至少在这些XML文件上)很麻烦,我需要编写大量代码.从后一种观点来看,如果我可以将XML文件编组为Java对象(如JAX-B那样),将会极大地帮助我,但是这将一次加载整个文件以及大量的Object实例.

我的问题是,是否有某种方法可以按顺序对文件进行拉解析(或仅部分解析),然后仅将那些部分编组为Java对象,这样我就可以轻松处理它们而不会浪费内存?

解决方案

首先,我要感谢两个人回答了我的问题,但最终我最终没有使用这些命题,部分原因是因为那些提议的技术距离Java可以说是标准XML解析",到目前为止,当Java中已经存在类似的工具时,感觉很奇怪,部分原因是实际上我确实找到了仅使用Java API来实现此目的的解决方案.

由于我已经完成了实现,因此我不会详细介绍所找到的解决方案,并且在这里放置了大量的代码(我使用Spring Batch进行了很多配置,在此之上)和东西).

但是,我将对我最终做的事情做一小段评论:

这里的一个大主意是,如果您有一个XML文档并且它是对应的XSD架构,那么您可以解析&可以使用JAXB封送它,您可以分块进行处理,并且可以使用偶数解析器(例如STAX)读取所述块,然后将其传递给JAXB Marshaller.

实际上,这意味着您必须首先确定XML文件中的好位置,您可以说这里的这一部分具有很多重复的结构,我将一次处理这些重复".这些重复部分通常是在父标签内重复很多的相同(子)标签.因此,您要做的就是在STAX解析器中创建一个事件侦听器,该事件侦听器在每个这些子标记的开始处触发,而不是将该子标记的内容流式传输到JAXB中,并用JAXB封送它并对其进行处理.

在这篇文章中确实很好地描述了这个想法,(我确实是从2006年开始的,但是它处理的是JDK 1.6,当时它还很新,因此从版本角度讲它还不算老):/p>

http://www.javarants.com/2006/04/30/simple-and-efficiency-xml-parsing-using-jaxb-2-0/

I have the following issue: I have very large XML files (like 300+ Megs), and I need to parse them in order to add some of their values to the db. The structure of these files is also very complex. I want to use Stax Parser as it offers the nice possibility of pull-parsing (and thus processing) only parts of the XML file at a time, and thus not loading the whole thing in memory, but on the other hand getting the values with Stax (at least on these XML files) is cumbersome, I need to write a ton of code. From this latter point of view it will immensly help me if I could marshall the XML file to Java objects (like JAX-B does) however this would load the whole file plus a ton of Object instances in memory all at once.

My question is, is there some way to pull-parse (or just partially parse) the file sequentially, and then marshall only those parts to Java objects so I can deal with them easily without bogging down on memory?

解决方案

Well, first off I wanna thank the two persons answering my questions, but I finally ended up not using those propositions partly because those proposed technologies are a bit far from the Java let's say "standard XML parsing" and it feels weird going so far when there's a similar tool already present in Java and partly also because in fact I did found a solution that only uses Java API's to accomplish this.

I will not detail too much the solution I found, because I've already finished the implementation, and it's quite a big chunk of code to place here (I use Spring Batch on top of it all, with a ton of configuration and stuff).

I will however make a small comment on what I finally ended up doing:

The big idea here is the fact that if you have an XML document AND it's corresponding XSD schema, you can parse & marshall it with JAXB, and you can do it in chunks, and said chunks can be read with an even parser such as STAX and then passed to the JAXB Marshaller.

This practically means that you must first decide where's a good place in your XML file where you can say "this part here has A LOT of repetive structure, I will treat those repetitions one at a time". Those repetitive parts are usually the same (child) tag repeated a lot inside a parent tag. So all you have to do is make an event listener in your STAX parser that is triggered at the start of each of those child tags, than stream over to JAXB the content of that child tag, marshall it with JAXB and process it.

Really the idea is excellently described in this article, which I followed (true, it's from 2006, but it deals with JDK 1.6 which at that time was pretty new, so version-wise it's not that old at all):

http://www.javarants.com/2006/04/30/simple-and-efficient-xml-parsing-using-jaxb-2-0/

这篇关于解析非常大的XML文件并编组到Java对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆