XML:处理大量数据 [英] XML: Process large data

查看:178
本文介绍了XML:处理大量数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

你推荐什么XML分析器用于以下用途:

What XML-parser do you recommend for the following purpose:

中的XML文件(格式,含空格)大约是800 MB。它主要包含三种类型的标签(我们姑且称之为N,W和R)。
他们有一个属性叫做id,我不得不寻找,尽可能快的。

The XML-file (formatted, containing whitespaces) is around 800 MB. It mostly contains three types of tag (let's call them n, w and r). They have an attribute called id which i'd have to search for, as fast as possible.

删除属性我不需要可以节省30%左右,可能多一点。

Removing attributes I don't need could save around 30%, maybe a bit more.

对于优化第二部分的第一部分:有没有什么好的工具(命令行linux和可能的话视窗)在某些标记的轻松的删除未使用的属性?我知道的XSLT可以使用。还是有什么简单的方法吗?另外,我可以把它分成三个文件,每个标签获得速度供以后分析...
速度是不是为这种数据preparation太重要了,当然,这将是很好的时候花了几分钟,而不是几小时。

First part for optimizing the second part: Is there any good tool (command line linux and windows if possible) to easily remove unused attributes in certain tags? I know that XSLT could be used. Or are there any easy alternatives? Also, I could split it into three files, one for each tag to gain speed for later parsing... Speed is not too important for this preparation of the data, of course it would be nice when it took rather minutes than hours.

第二部分:一旦我有数据prepared,将它缩短或没有,我应该能够搜索ID属性,我提的是,这之中时间的关键。

Second part: Once I have the data prepared, be it shortened or not, I should be able to search for the ID-attribute I was mentioning, this being time-critical.

估算 WC -l <​​/ code>告诉我,大约有3M的N-标签和周围418K W-标签。后者那些可以包含多达每大约20子标签。 W-标签还含有一些,但他们将被剥夺。

Estimations using wc -l tell me that there are around 3M N-tags and around 418K W-tags. The latter ones can contain up to approximately 20 subtags each. W-Tags also contain some, but they would be stripped away.

我必须做的是包含某个ID的属性标签之间进行导航。有些标签具有其它ID的引用,因此给我一棵树,甚至是一个曲线图。原始数据是大的(如前所述),但该结果应该不会太大,因为我只需要挑选出某些元素。

"All I have to do" is navigating between tags containing certain id-attributes. Some tags have references to other id's, therefore giving me a tree, maybe even a graph. The original data is big (as mentioned), but the resultset shouldn't be too big as I only have to pick out certain elements.

现在的问题是:什么XML解析库我应该使用这种处理?我会使用Java 6中的第一个实例,以铭记将它移植到黑莓手机。

Now the question: What XML parsing library should I use for this kind of processing? I would use Java 6 in a first instance, with having in mind to be porting it to BlackBerry.

只创建一个平面文件索引ID和指向一个文件中的偏移量可能是有用的?它甚至需要做的上半部分中提到的优化?或者,有解析器已知与原始数据是相当快?

Might it be useful to just create a flat file indexing the id's and pointing to an offset in the file? Is it even necessary to do the optimizations mentioned in the upper part? Or are there parser known to be quite as fast with the original data?

小提示:为了试验,我把ID为上的文件的最后一行,并使用grep搜索的ID。这前后花了一分钟,在酷睿2

Little note: To test, I took the id being on the very last line on the file and searching for the id using grep. This took around a minute on a Core 2 Duo.

如果该文件增长更大,会发生什么,让我们说5 GB?

What happens if the file grows even bigger, let's say 5 GB?

我AP preciate任何通知或建议。
谢谢大家非常事先和问候

I appreciate any notice or recommendation. Thank you all very much in advance and regards

推荐答案

由于布曼指出,这当作纯文本处理会给你最好的速度。

As Bouman has pointed out, treating this as pure text processing will give you the best possible speed.

要处理此作为XML的唯一可行的方法是使用SAX解析器。在的Java API建立SAX解析器是完全有能力处理这个所以没有必要安装任何第三方库。

To process this as XML the only practical way is to use a SAX parser. The Java APIs build in SAX parser is perfectly capable of handling this so there is no need to install any third party libraries.

这篇关于XML:处理大量数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆