用于处理大型文档的工具 [英] What tool to use for processing large documents

查看:59
本文介绍了用于处理大型文档的工具的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

嗨伙计,


我是新来的,我需要一些建议来使用什么工具。


我'我正在使用XML进行基准测试。我正在写一些我想分析的科学

程序。我的程序生成大型XML日志

给出关于程序流程的半结构化信息。 XML

树看起来像是方法调用树,但是在更高的层次上,我在
中添加了一些变量的值。


没有预定义的架构,通常在我修改我的程序时我会将
添加一些新的标签和新信息放入日志中。


写完日志之后,我永远不会修改文档。


为了分析数据,我添加一个/几乎/完美的解决方案:来自Matlab,我

将调用Java库dom4j的方法。通常情况下,我会将
加载一个文档,然后将匹配XPath

表达式的属性值转储到Matlab数组中,然后执行一些统计或绘图。我非常满意这个解决方案的舒适性和易用性:没有数据库可以设置,

只需加载文档,Matlab为您提供了一个环境其中

你可以在不创建java程序的情况下调用java方法,所以它很容易调试你传递给dom4j''的XPath表达式。 selectNodes"

方法。


现在,问题是,它非常适合几个10的文件

兆字节,但现在我想处理几百个b $ b MB的文件,比方说,可能是10 GB(这是一个相当大的上限)。


看来我必须放弃使用dom4j。我曾尝试使用

eXist来创建一个包含我的文档的数据库,当我尝试运行第一个时,我得到的是很多

(相当暴力)崩溃例如,他们在文档中提供了

,用于通过XML:DB api检索文档。然后我尝试了BerkeleyDB XML,这是我无法安装的。然后我尝试了
xmlDB,但是当我尝试将第一个文档导入到一个集合中时,我得到了一个b。一个java.lang.OutOfMemoryError:Java堆空间。并且在

中没有提到如何指定堆空间的文档。


在这3次不成功的试验之后,我想问一些建议!


总结一下,我的需求是:

*处理(非常)大型XML文档

*需要XPath

* Java API,能够从Matlab调用

*只读处理

*单用户,无安全问题,无远程访问需要

*平台:Java尽可能,否则Linux / Debian on x86。


我欢迎任何建议。


- Luc Mercier。

Hi Folks,

I''m new here, and I need some advice for what tool to use.

I''m using XML for benchmarking purposes. I''m writing some scientific
programs which I want to analyze. My program generates large XML logs
giving semi-structured information on the flow of the program. The XML
tree looks like the method calls tree, but at a much higher level, and I
add many values of some variables.

There is no predefined schema, and often as I modify my program I will
add some new tags and new information to put into the log.

Once a log is written, I never modify the document.

To analyze the data, I add a /almost/ perfect solution: from Matlab, I
would call the methods of the Java library dom4j. Typically, I would
load a document, then dump values of attributes matching an XPath
expression into a Matlab array, then do some stats or plotting. I''m very
happy with the comfort and the ease of this solution: no DB to set up,
just load a document, and and Matlab gives you an environment in which
you can call java methods without creating a java program, so it''s very
easy to debug the XPath expressions you pass to dom4j''s "selectNodes"
method.

Now, the problem is, it''s perfect for documents of a few 10''s of
megabytes, but now I would like to process documents of several hundreds
MBs to, let''s say, maybe 10 GB (that''s a fairly large upper bound).

It seems I have to give up with dom4j for that. I have tried to use
eXist to create a DB with my documents, and all I got was a lot of
(rather violent) crashes when I tried to run the first example they give
in the doc for retrieving a document via the XML:DB api. Then I tried
BerkeleyDB XML, which I have not been able to install. I then tried
xmlDB, but as I tried to import a first document into a collection I got
a "java.lang.OutOfMemoryError: Java heap space" and found no mention in
the doc of how to specify the heap space.

After these 3 unsuccessful trials, I''d like to ask for some advice!

To summarize, my needs are:
* Processing (very) large XML documents
* Need for XPath
* Java API, to be able to call from Matlab
* Read-only processing
* Single user, no security issues, no remote access need
* Platform: Java if possible, otherwise Linux/Debian on x86.

I welcome any suggestion.

- Luc Mercier.

推荐答案

Luc Mercier写道:
Luc Mercier wrote:

*处理(非常大的XML文档

*需要XPath
* Processing (very) large XML documents
* Need for XPath



这个组合听起来像你想要一个严肃的XML数据库。如果完成

,那应该给你一个系统已经知道如何处理大于内存的文件和一个实现XPath数据的文件

检索它们,让你只实现程序逻辑。


另一个解决方案是不要同时处理整个文档。

相反,采用流式处理,基于SAX,具有相对少量的b $ b少量持久数据。您可以对提取进行手动编码,或者已经有一些论文描述了可用于过滤

SAX流的系统,并仅提取与指定XPath匹配的子树。

当然你可能需要重新处理整个流,以便
评估不同的XPath,但这是一种解决内存限制的方法。

它适用于某些特定系统,无论是单独使用还是通过喂食

这个已过滤的系统。 SAX流转换为模型构建器以构建模型

,它仅反映应用程序实际关注的数据。另一方面,如果您需要对完整文档进行真正的随机访问,那么

这对您来说不会这样做。


-

()ASCII Ribbon Campaign | Joe Kesselman

/ \标记HTML电子邮件! |系统架构和动态诗歌

That combination sounds like you want a serious XML database. If done
right, that should give gives you a system which already knows how to
handle documents larger than memory and one which implements XPath data
retrieval against them, leaving you to implement just the program logic.

Another other solution is not to work on the whole document at once.
Instead, go with streaming-style processing, SAX-based with a relatively
small amount of persisting data. You can hand-code the extraction, or
there have been papers describing systems which can be used to filter a
SAX stream and extract just the subtrees which match a specified XPath.
Of course you may have to reprocess the entire stream in order to
evaluate a different XPath, but it is a way around memory constraints.
It works very well for some specific systems, either alone or by feeding
this "filtered" SAX stream into a model builder to construct a model
that reflects only the data your application actually cares about. On
the other hand, if you need true random access to the complete document,
this won''t do it for you.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry


Luc Mercier写道:
Luc Mercier wrote:

*处理(非常)大型XML文档

*需要XPath
* Processing (very) large XML documents
* Need for XPath



这个组合听起来像你想要一个严肃的XML数据库。如果完成

,那应该给你一个系统已经知道如何处理大于内存的文件和一个实现XPath数据的文件

检索它们,让你只实现程序逻辑。

(我没有使用过任何这些,但我会抛弃我的标准

提醒一下,IBM的DB2现在具有特定于XML的功能。我不确定是否已经在IBM的基于Java的数据库Cloudscape中获取了这些功能。) />

另一种解决方案是不要立即处理整个文档。相反,

采用流式处理,基于SAX,具有相对少量的b $ b少量持久数据。您可以对提取进行手动编码,或者已经有一些论文描述了可用于过滤

SAX流的系统,并仅提取与指定XPath匹配的子树。

当然你可能需要重新处理整个流,以便
评估不同的XPath,但这是一种解决内存限制的方法。

它适用于某些特定系统,无论是单独使用还是通过喂食

这个已过滤的系统。 SAX流转换为模型构建器以构建模型

,它仅反映应用程序实际关注的数据。另一方面,如果您需要对完整文档进行真正的随机访问,那么

这对您来说不会这样做。


-

()ASCII Ribbon Campaign | Joe Kesselman

/ \标记HTML电子邮件! |系统architexture和动态诗歌

That combination sounds like you want a serious XML database. If done
right, that should give gives you a system which already knows how to
handle documents larger than memory and one which implements XPath data
retrieval against them, leaving you to implement just the program logic.
(I haven''t worked with any of these, but I''ll toss out my standard
reminder that IBM''s DB2 now has XML-specific capabilities. I''m not sure
whether those have been picked up in Cloudscape, IBM''s Java-based database.)

Another solution is not to work on the whole document at once. Instead,
go with streaming-style processing, SAX-based with a relatively
small amount of persisting data. You can hand-code the extraction, or
there have been papers describing systems which can be used to filter a
SAX stream and extract just the subtrees which match a specified XPath.
Of course you may have to reprocess the entire stream in order to
evaluate a different XPath, but it is a way around memory constraints.
It works very well for some specific systems, either alone or by feeding
this "filtered" SAX stream into a model builder to construct a model
that reflects only the data your application actually cares about. On
the other hand, if you need true random access to the complete document,
this won''t do it for you.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry


Luc Mercier写道:
Luc Mercier wrote:

现在,问题是,它是'非常适合10几欧元的文件,但是现在我想处理数百个文件的数据,比方说,也许我想说,也许10 GB(这是一个相当大的上限)。
Now, the problem is, it''s perfect for documents of a few 10''s of
megabytes, but now I would like to process documents of several hundreds
MBs to, let''s say, maybe 10 GB (that''s a fairly large upper bound).



无论你使用什么XML解析器,请注意解析器

的解析速度不比磁盘读取XML数据的速度快。

从磁盘读取10 GB大约需要3到5分钟

仅用于磁盘访问。一起阅读和解析应该花费大约20分钟,即使是最好的解析器。

Whatever XML parser you use, notice that the parser
cannot parse faster than the disk can read the XML data.
Reading 10 GB off a disk will take around 3 to 5 minutes
for disk access alone. Reading and parsing together should
take around 20 minutes, even with the best parsers.


我似乎不得不放弃使用dom4j那。我曾尝试使用

eXist来创建一个包含我的文档的数据库,当我尝试运行第一个时,我得到的是很多

(相当暴力)崩溃例如,他们在文档中提供了

,用于通过XML:DB api检索文档。然后我尝试了BerkeleyDB XML,这是我无法安装的。然后我尝试了
xmlDB,但是当我尝试将第一个文档导入到一个集合中时,我得到了一个b。一个java.lang.OutOfMemoryError:Java堆空间。并且在

中没有提到如何指定堆空间的文档。
It seems I have to give up with dom4j for that. I have tried to use
eXist to create a DB with my documents, and all I got was a lot of
(rather violent) crashes when I tried to run the first example they give
in the doc for retrieving a document via the XML:DB api. Then I tried
BerkeleyDB XML, which I have not been able to install. I then tried
xmlDB, but as I tried to import a first document into a collection I got
a "java.lang.OutOfMemoryError: Java heap space" and found no mention in
the doc of how to specify the heap space.



请记住,DOM是CPU地址空间中XML数据的完整副本

。如果你的XML数据有10亿GB,那么你的地址空间必须至少为10 GB。

这在今天的机器上是不现实的。

Remember that a DOM is a complete copy of the XML data
in the address space of the CPU. If your XML data has
10 GB, then your address space has to be at least 10 GB.
This is unrealistic on today''s machines.


*处理(非常)大型XML文档

*需要XPath

* Java API,能够从中调用Matlab

*只读处理

*单用户,无安全问题,无远程访问需求

*平台:如果可能,Java,否则Linux / Debian on x86。
* Processing (very) large XML documents
* Need for XPath
* Java API, to be able to call from Matlab
* Read-only processing
* Single user, no security issues, no remote access need
* Platform: Java if possible, otherwise Linux/Debian on x86.



Java的SAX API可以帮助你解析数据,但是SAX将会允许你使用XPath。

_not_允许你使用XPath。

Java''s SAX API can help you parse the data, but SAX will
_not_ allow you to use XPath.


我欢迎任何建议。
I welcome any suggestion.



好​​的,我假设XPath处理结果

比原始XML数据短得多。如果是这样的话,我打赌这个问题可以在xgawk中解决:

http://home.vrweb.de/~juergen.kahrs / ...一个XML文件


我用xgawk解析了几GB的文件。

它有效,但需要几个分钟,当然。

OK, I am assuming that the result of XPath processing
is much shorter than the original XML data. If so, I
bet the problem can be solved in xgawk:

http://home.vrweb.de/~juergen.kahrs/...of-an-XML-file

I have used xgawk for parsing files with several GB.
It works, but it will take several minutes, of course.


这篇关于用于处理大型文档的工具的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆