XSLT 工作太慢 [英] XSLT works too slow

查看:46
本文介绍了XSLT 工作太慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大约 100 个 XML 文件,我想将它们转换为具有更好结构的另一个文件.此示例将其转换为 CSV,但我还有一个将其转换为更好的 XML 的变体.格式与我无关.我看到有很多这样的问题,但我发现这些例子很难适应,因为问题不是样式表不起作用而是它太慢了.

I have around 100 XML files which I want to transform into another file with a better structure. This example takes it into CSV, but I have also a variant that transforms it into better XML. Format is not that relevant for me. I see there are tons of questions like this, but I find the examples hard to adapt as the problem is not that the stylesheet wouldn't work but that it is too slow.

我的数据文件大小在 4-12 MB 之间.我在这里提供的 XSLT 可以很好地处理小文件.例如,当我将一个文件切成 250 KB 时,样式表可以很好地处理它(尽管这已经花费了大约 30 秒).当我尝试对实际更大的数据文件进行尝试时,它似乎永远无法完成工作 - 甚至没有一个文件.我有 Oxygen XML 编辑器,我一直在使用 Saxon-HE 9.5.1.2 进行转换.

The sizes of my data files are between 4-12 MB. The XSLT I have provided here works well with small files. As an example, when I cut a file to 250 KB piece the stylesheet processes it well (though this takes already around 30 seconds). When I try it to the actual larger data file it just never seems to finish the job - not even with one file. I have Oxygen XML Editor, I've been using Saxon-HE 9.5.1.2 for the transformation.

一句话:这仍然可能很慢.我可以离开我的电脑做一夜之类的事情.这涉及一个格式错误的数据集,我根本不需要经常重复这种转换.

One remark: this can still be slowish. I can leave my computer to do it for overnight or something. This concerns one malformed dataset and I don't need to repeat this transformation often at all.

所以我的问题是:

在这个 XSLT 中是否有什么东西让它工作得特别慢?其他方法会更好吗?

这些是简化的工作示例.实际的数据文件在结构上是相同的,但有更多的节点,我在这个例子中称之为单词".属性 type 指定了我所追求的节点.它是包含方言词及其规范化版本的语言方言数据.

These are simplified working examples. The actual data files are structurally identical, but have more nodes which I called "words" in this example. The attribute type specifies which nodes I'm after. It is linguistic dialect data with dialectal words and their normalized versions.

这是 XML.

<?xml version="1.0" encoding="UTF-8"?>
<xml>
<order>
    <slot id="ts1" value="1957"/>
    <slot id="ts2" value="1957"/>
    <slot id="ts3" value="2389"/>
    <slot id="ts4" value="2389"/>
    <slot id="ts5" value="2389"/>
    <slot id="ts6" value="2389"/>
    <slot id="ts7" value="3252"/>
    <slot id="ts8" value="3252"/>
    <slot id="ts9" value="3252"/>
    <slot id="ts10" value="3360"/>
</order>
<words type="original word">
    <annotation>
        <data id_1="ts1" id_2="ts3">
            <text>dialectal_word_1</text>
        </data>
    </annotation>
    <annotation>
        <data id_1="ts4" id_2="ts7">
            <text>dialectal_word_2</text>
        </data>
    </annotation>
    <annotation>
        <data id_1="ts8" id_2="ts10">
            <text>,</text>
        </data>
    </annotation>
</words>
<words type="normalized word">
    <annotation>
        <data id_1="ts2" id_2="ts5">
            <text>normalized_word_1</text>
        </data>
    </annotation>
    <annotation>
        <data id_1="ts6" id_2="ts9">
            <text>normalized_word_2</text>
        </data>
    </annotation>
</words>
</xml>

这是 XSLT.它试图做的是挑选在 XML 结构中具有匹配值的对.

This is the XSLT. What it attempts to do is to pick up the pairs which have matching values up in the XML structure.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:output method="text" encoding="UTF-8" indent="yes"/>
<xsl:template match="/xml">
    <xsl:text>original&#x9;normalized
</xsl:text>
        <xsl:for-each select="words[@type='original word']/annotation/data">
            <xsl:sort select="substring-after(@id_1, 'ts')" data-type="number"/>
            <xsl:variable name="origStartTimeId" select="@id_1"/>
            <xsl:variable name="origEndTimeId" select="@id_2"/>
            <xsl:variable name="origStartTime_VALUE" select="/xml/order/slot[@id=$origStartTimeId]/@value"/>
            <xsl:variable name="origEndTime_VALUE" select="/xml/order/slot[@id=$origEndTimeId]/@value"/>
                    <xsl:value-of select="text"/>
            <xsl:text>&#x9;</xsl:text>    
                <xsl:for-each select="/xml/words[@type='normalized word']/annotation/data">
                    <xsl:variable name="normStartTime" select="@id_1"/>
                    <xsl:variable name="normEndTime" select="@id_2"/>
                    <xsl:variable name="normStartTime_VALUE" select="/xml/order/slot[@id=$normStartTime]/@value"/>
                    <xsl:variable name="normEndTime_VALUE" select="/xml/order/slot[@id=$normEndTime]/@value"/>
                    <xsl:if test="($normStartTime_VALUE = $origStartTime_VALUE) and ($normEndTime_VALUE = $origEndTime_VALUE)">
                            <xsl:value-of select="text"/>    
                    </xsl:if>
                </xsl:for-each>
            <xsl:text>
</xsl:text>
        </xsl:for-each>
</xsl:template>
</xsl:stylesheet>

什么是输出就是这样:

original    normalized
dialectal_word_1    normalized_word_1
dialectal_word_2    normalized_word_2
,   

那对我来说没问题.

谢谢!

推荐答案

当前样式表中的双嵌套 for-each 效率低下,并且会随着文件大小的增加而变得更糟 - 您已经获得了(原始单词数)*(归一化单词的数量) 迭代,本质上是二次复杂度(假设文件中的原始单词和归一化单词的数量大致相同).如果您使用,您可以做得更好,它通过构建一个查找表来工作,您可以使用该表非常快速地查找节点(通常在常数而不是线性时间内).

The double nested for-each in your current stylesheet is inefficient and will get worse as the size of the file grows - you've got (number of original words)*(number of normalized words) iterations, essentially quadratic complexity (assuming there's roughly the same number of original and normalized words in the file). You can do much better if you use keys, which work by building a lookup table that you can use to find nodes very quickly (typically in constant rather than linear time).

<!-- I've said version="2.0" to match your stylesheet in the question, but this
     code is actually valid XSLT 1.0 as it doesn't use any 2.0-specific features
     or functions -->
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
  <xsl:output method="text" encoding="UTF-8" indent="yes"/>

  <!-- first key to look up slot elements by their id -->
  <xsl:key name="slotById" match="slot" use="@id" />
  <!-- second key to look up normalized word annotations by the value of their slots -->
  <xsl:key name="annotationBySlots" match="words[@type='normalized word']/annotation"
           use="concat(key('slotById', data/@id_1)/@value, '|',
                       key('slotById', data/@id_2)/@value)" />

  <xsl:template match="/xml">
    <xsl:text>original&#x9;normalized&#xA;</xsl:text>
    <xsl:apply-templates select="words[@type = 'original word']/annotation" />
  </xsl:template>

  <xsl:template match="annotation">
    <xsl:value-of select="data/text" />
    <xsl:text>&#x9;</xsl:text>
    <xsl:value-of select="
            key('annotationBySlots',
                concat(key('slotById', data/@id_1)/@value, '|',
                       key('slotById', data/@id_2)/@value)
            )/data/text" />
    <xsl:text>&#xA;</xsl:text>
  </xsl:template>
</xsl:stylesheet>

这应该在线性时间内运行(每个原始单词注释一个迭代",加上构建查找表所花费的时间,该查找表在槽数加上规范化词注释).

This should run in linear time (one "iteration" per original word annotation, plus the time taken to build the lookup tables which again should be linear in the number of slots plus the number of normalized word annotations).

这篇关于XSLT 工作太慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆