如何高效地并行运行大量文件的 XSLT 转换? [英] How can I efficiently run XSLT transformations for a large number of files in parallel?

查看:50
本文介绍了如何高效地并行运行大量文件的 XSLT 转换?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我每次都必须定期转换 1 个文件夹内的大量 XML 文件(最小 100K)(基本上,来自解压缩的输入数据集),我想学习如何以最有效的方式进行转换可能的.我的技术栈由 XLT 和从 Bash 脚本调用的 Saxon XSLT Java 库组成.它运行在具有 8 核和 64Gb 内存的 SSD RAID 的 Ubuntu 服务器上.请记住,我能很好地处理 XSLT,但我仍在学习 Bash 以及如何为此类任务正确分配负载(Java 几乎也只是一个词).

I have to regularly transform large amount of XML files (min. 100K) within 1 folder each time (basically, from the unzipped input dataset), and I'd like to learn how to do that in the most efficient way as possible. My technological stack consists of XLTs and the Saxon XSLT Java libraries, called from Bash scripts. And it runs on an Ubuntu server with 8 cores and a Raid of SSD with 64Gb of Ram. Keep in mind I handle nicely XSLT but I'm still in the process of learning Bash and how to distribute the loads properly for such tasks (and Java is almost just a word at that point too).

之前创建了一篇关于这个问题的帖子,因为我的方法看起来效率很低,实际上需要帮助才能正常运行(见这个SOF 帖子).后来有很多评论,以不同的方式提出问题是有意义的,因此这篇文章.有人向我提出了几种解决方案,其中一种目前比我的要好得多,但它仍然可以更加优雅和高效.

I previously created a post regarding this issue, as my approach seemed very inefficient and was actually in need of help to properly run (See this SOF post). A lot of comments later, it makes sense to present the issue differently, therefore this post. I was proposed several solutions, one currently working much better than mine, but it could still be more elegant and efficient.

现在,我正在运行:

printf -- '-s:%s\0' input/*.xml | xargs -P 600 -n 1 -0 java -jar saxon9he.jar -xsl:some-xslt-sheet.xsl

我根据之前的一些测试设置了 600 个进程.更高只会从 Java 中抛出内存错误.但它现在只使用了 30 到 40Gb 的内存(尽管所有 8 个内核都处于 100%).

I set 600 processes based on some previous tests. Going higher would just throw memory errors from Java. But it is only using between 30 to 40Gb of Ram now (all 8 cores are at 100% though).

简而言之,这是我目前的所有建议/方法:

To put it in a nutshell, here is all the advices/approaches I have so far :

  1. 在子文件夹中拆分整个 XML 文件(例如,包含每个5K 文件),并以此作为为每个子文件夹运行并行转换脚本的方式
  2. 专门使用 Saxon-EE(允许多线程执行)使用 collection() 函数解析 XML 文件
  3. 使用较少的任务数设置 Java 环境,或减少每个进程的内存
  4. 指定 Saxon 是否与 XSLT 表兼容libxml/libxslt(是不是只针对XSLT1.0?)
  5. 使用专门的 shell,例如 xmlsh
  1. Splitting the whole XML files among subfolders (e.g. containing each 5K files), and use this as a way to run in parallel transformation scripts for each subfolder
  2. Use specifically the Saxon-EE library (allowing multithreaded execution) with the collection() function to parse the XML files
  3. Set the Java environment with a lower number of tasks, or decrease the memory per process
  4. Specifying Saxon regarding if the XSLT sheets are compatible with libxml/libxslt (isn't it only for XSLT1.0?)
  5. Use a specialized shell such as xmlsh

我可以处理解决方案#2,它应该直接启用控制循环并只加载一次JVM;#1 似乎更笨拙,我仍然需要改进 Bash(负载分布和性能,处理相对/绝对路径);#3、#4 和 #5 对我来说是全新的,我可能需要更多的解释来了解如何解决这个问题.

I can handle the solution #2, and it should directly enable to control the loop and load JVM only once ; the #1 seems more clumsy and I still need to improve in Bash (load distribution & perf, tackling relative/absolute paths) ; the #3, #4 and #5 are totally new to me and I may need more explanations to see how to tackle that.

不胜感激任何输入.

推荐答案

尝试使用 xsltproc 命令行工具 来自 libxslt.它可以将多个 xml 文件作为参数.要这样调用它,您需要先创建一个输出目录.试着这样称呼它:

Try using the xsltproc command line tool from libxslt. It can take multiple xml files as arguments. To call it like that, you'll need to create an output directory first. Try calling it like this:

mkdir output
xsltproc -o output/ some-xslt-sheet.xsl input/*.xml

这篇关于如何高效地并行运行大量文件的 XSLT 转换?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆