如何在Java中有效地解析200,000个XML文件? [英] How can I efficiently parse 200,000 XML files in Java?

查看:93
本文介绍了如何在Java中有效地解析200,000个XML文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有200,000个XML文件要解析并存储在数据库中。

I have 200,000 XML files I want to parse and store in a database.

以下是一个示例: https://gist.github.com/902292

这和获得XML文件。这也将在小型VPS(Linode)上运行,因此内存很紧。

This is about as complex as the XML files get. This will also run on a small VPS (Linode) so memory is tight.

我想知道的是:

1)我应该使用DOM还是SAX解析器? DOM似乎更容易,更快,因为每个XML都很小。

1) Should I use a DOM or SAX parser? DOM seems easier and faster since each XML is small.

2)关于所述解析器的简单教程在哪里? (DOM或SAX)

2) Where is a simple tutorial on said parser? (DOM or SAX)

谢谢

编辑

我尝试过DOM路由,即使每个人都建议使用SAX。主要是因为我找到了一个更容易的DOM教程,我认为由于平均文件大小约为3k - 4k,因此很容易将其保存在内存中。

I tried the DOM route even though everyone suggested SAX. Mainly because I found an "easier" tutorial for DOM and I thought that since the average file size was about 3k - 4k it would easily be able to hold that in memory.

但是,我写了一个递归例程来处理所有200k文件,它通过它们大约有40%,然后Java内存不足。

However, I wrote a recursive routine to handle all 200k files and it gets about 40% of the way through them and then Java runs out of memory.

这是部分该项目。
https://gist.github.com/905550#file_xm_lparser.java

我现在应该抛弃DOM而只使用SAX吗?看起来像这样的小文件DOM应该能够处理它。

Should I ditch DOM now and just use SAX? Just seems like with such small files DOM should be able to handle it.

此外,速度是足够快。解析2000个XML文件需要大约19秒(在Mongo插入之前)。

Also, the speed is "fast enough". It's taking about 19 seconds to parse 2000 XML files (before the Mongo insert).

谢谢

推荐答案

SAX总是快速击败DOM。但是,既然你说XML文件很小,你可以继续使用DOM解析器。您可以做的一件事就是加速创建一个Thread-Pool并在其中执行数据库操作。多线程更新将显着提高性能。

SAX always beats DOM at speed. But since you say XML files are small you may proceed with DOM parser. One thing you can do to speedup is create a Thread-Pool and do the database operations in it. Multithreaded updates will significantly improve the performance.


  • Lalith

这篇关于如何在Java中有效地解析200,000个XML文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆