什么是最高效的基于 Java 的流式 XSLT 处理器? [英] What is the Most Efficient Java-Based streaming XSLT Processor?

查看:22
本文介绍了什么是最高效的基于 Java 的流式 XSLT 处理器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常大的 XML 文件,我需要将其转换为另一个 XML 文件,我想使用 XSLT 来完成此操作.我更感兴趣的是内存优化,而不是速度优化(不过,速度也不错!).

对于此任务,您会推荐哪种基于 Java 的 XSLT 处理器?

您是否会推荐其他任何方式(非 XSLT?非 Java?),如果是,为什么?

问题中的 XML 文件非常大,但不是很深——有数百万行(元素),但只有大约 3 级深.

解决方案

目前只有三个已知的 XSLT 2.0 处理器 和来自它们的 Saxon 9.x 在速度和内存利用方面可能是最有效的(至少根据我的经验).Saxon-SA(Saxon 的模式感知版本,不像 B (basic) 版本) 具有用于流式处理的特殊扩展.

来自各种现有的 XSLT 1.0 处理器,.NET XslCompiledTransform(基于 C#,不是 Java!)似乎是冠军.

在基于 Java 的 XSLT 1.0 处理器世界 Saxon 6.xa> 还是不错的.

更新:

现在,从最初回答这个问题之日起 3 年多,没有任何证据表明所提到的 XSLT 处理器之间的效率差异发生了变化.

至于流式传输:

  1. 即使没有任何流式处理,也可以很好地处理具有数百万个节点"的 XML 文档.我进行了一项实验,其中 Saxom 9.1.07 处理了一个 XML 文档,该文档包含大约一百万个具有整数值的 3 级元素.转换只是计算它们的总和.在我的电脑上转换的总时间不到 1.5 秒.使用的内存为 500MB——这在 10 年前的 PC 上也可以拥有,

以下是 Saxon 的信息性消息,显示了有关转型的详细信息:

<块引用>

来自 Saxonica 的 Saxon 9.1.0.7JJava 版本 1.6.0_17样式表编译时间:190 毫秒处理文件:/C:	empdeleteMRowst.xml使用类 net.sf.saxon.tinytree.TinyBuilder 为文件构建树:/C:	empdeleteMRowst.xml树在 1053 毫秒内构建树大小:3075004个节点,1800000个字符,0个属性加载 net.sf.saxon.event.MessageEmitter执行时间:1448毫秒已用内存:506661648NamePool 内容:14 个链中的 14 个条目.6 个前缀,6 个 URI

  1. 撒克逊 9.4 有一个撒克逊:stream() 扩展函数,可用于处理巨大的 XML 文档.

以下是文档摘录:

<块引用>

在 Saxon 中基本上有两种进行流式传输的方法:

突发模式流:使用这种方法,转换大文件被分解成一系列小文件的转换文件的片段.每一块依次从输入中读取,翻转变成内存中的一棵小树,转换并写入输出文件.

这种方法适用于结构相当扁平的文件,例如一个包含数百万条日志记录的日志文件,其中每个日志记录的处理是独立的之前.

该技术的一个变体使用了新的 XSLT 3.0 xsl:iterate迭代记录的指令,代替 xsl:for-each.这允许维护工作数据,因为记录是已处理:例如,这可以输出总计或在运行结束时取平均值,或对一个进行处理记录取决于文件中它之前的内容.xsl:迭代指令还允许提前退出循环,这使得它可以从一开始就进行转换以处理数据大文件,而无需实际读取整个文件.

突发模式流在 XSLT 和 XQuery 中都可用,但有在 XQuery 中不等同于 xsl:iterate 构造.

流模板:这种方法遵循传统的 XSLT执行输入 XML 的递归下降的处理模式通过将模板规则与每个级别的节点匹配来实现层次结构,但是一次只执行一个元素,无需在内存中构建树.

每个模板都属于一种模式(可能是默认的未命名模式),和流是可以使用指定的模式的属性新的 xsl:mode 声明.如果模式被声明为可流式传输,则该模式中的每个模板规则都必须遵守流式处理规则.

流处理中允许的规则相当复杂,但基本原则是模板规则一个给定的节点只能读取该节点的后代一次,在命令.目前的限制还施加了进一步的规则Saxon 实现:例如,虽然分组使用理论上是一致的使用流式实现,它目前没有在撒克逊人.

  1. XSLT 3.0 将具有标准的 流媒体功能.但是,W3C 文档仍处于工作草案"状态,流规范可能会在后续的草案版本中发生变化.因此,不存在当前草案(流媒体)规范的实现.

  2. 警告:并非所有转换都可以在流模式下执行——无论 XSLT 处理器如何.对于大型文档,无法在流模式(RAM 量有限)中执行转换的一个示例是对其元素进行排序(例如通过公共属性).

I have a very large XML file which I need to transform into another XML file, and I would like to do this with XSLT. I am more interested in optimisation for memory, rather than optimisation for speed (though, speed would be good too!).

Which Java-based XSLT processor would you recommmend for this task?

Would you recommend any other way of doing it (non-XSLT?, non-Java?), and if so, why?

The XML files in questions are very large, but not very deep - with millions of rows (elements), but only about 3 levels deep.

解决方案

At present there are only three XSLT 2.0 processors known and from them Saxon 9.x is probably the most efficient (at least according to my experience) both in speed and in memory utilisation. Saxon-SA (the schema-aware version of Saxon, not free as the B (basic) version) has special extensions for streamed processing.

From the various existing XSLT 1.0 processors, .NET XslCompiledTransform (C#-based, not Java!) seems to be the champion.

In the Java-based world of XSLT 1.0 processors Saxon 6.x again is pretty good.

UPDATE:

Now, more than 3 years from the date this question was originally answered, there isn't any evidence that the efficiency difference between of the XSLT processors mentioned has changed.

As for streaming:

  1. An XML document with "millions of nodes" may well be processed even without any streaming. I conducted an experiment in which Saxom 9.1.07 processed an XML document that contains around one million 3-rd level elements with integer values. The transformation simply calculates their sum. The total time for the transformation on my computer is less than 1.5 seconds. The used memory was 500MB -- something that PCs could have even 10 years ago,

Here are Saxon's informational messages that show details about the transformation:

Saxon 9.1.0.7J from Saxonica
Java version 1.6.0_17
Stylesheet compilation time: 190 milliseconds
Processing file:/C:	empdeleteMRowst.xml
Building tree for file:/C:	empdeleteMRowst.xml using class net.sf.saxon.tinytree.TinyBuilder
Tree built in 1053 milliseconds
Tree size: 3075004 nodes, 1800000 characters, 0 attributes
Loading net.sf.saxon.event.MessageEmitter
Execution time: 1448 milliseconds
Memory used: 506661648
NamePool contents: 14 entries in 14 chains. 6 prefixes, 6 URIs

  1. Saxon 9.4 has a saxon:stream() extension function that can be used for processing huge XML documents.

Here is an excerpt from the documentation:

There are basically two ways of doing streaming in Saxon:

Burst-mode streaming: with this approach, the transformation of a large file is broken up into a sequence of transformations of small pieces of the file. Each piece in turn is read from the input, turned into a small tree in memory, transformed, and written to the output file.

This approach works well for files that are fairly flat in structure, for example a log file holding millions of log records, where the processing of each log record is independent of the ones that went before.

A variant of this technique uses the new XSLT 3.0 xsl:iterate instruction to iterate over the records, in place of xsl:for-each. This allows working data to be maintained as the records are processed: this makes it possible, for example, to output totals or averages at the end of the run, or to make the processing of one record dependent on what came before it in the file. The xsl:iterate instruction also allows early exit from the loop, which makes it possible for a transformation to process data from the beginning of a large file without actually reading the whole file.

Burst-mode streaming is available in both XSLT and XQuery, but there is no equivalent in XQuery to the xsl:iterate construct.

Streaming templates: this approach follows the traditional XSLT processing pattern of performing a recursive descent of the input XML hierarchy by matching template rules to the nodes at each level, but does so one element at a time, without building the tree in memory.

Every template belongs to a mode (perhaps the default, unnamed mode), and streaming is a property of the mode that can be specified using the new xsl:mode declaration. If the mode is declared to be streamable, then every template rule within that mode must obey the rules for streamable processing.

The rules for what is allowed in streamed processing are quite complicated, but the essential principle is that the template rule for a given node can only read the descendants of that node once, in order. There are further rules imposed by limitations in the current Saxon implementation: for example, although grouping using is theoretically consistent with a streamed implementation, it is not currently implemented in Saxon.

  1. XSLT 3.0 would have standard streaming feature. However, the W3C document is still with a "working draft" status and the streaming specification is likely to change in subsequent draft versions. Due to this, no implementations of the current draft (streaming) specification exist.

  2. Warning: Not every transformation can be performed in streaming mode -- regardless of the XSLT processor. One example of a transformation that isn't possible to perform in a streaming mode (with a limited amount of RAM) for huge documents is sorting their elements (say by a common attribute).

这篇关于什么是最高效的基于 Java 的流式 XSLT 处理器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆