在 XSLT 中标记混合内容 [英] Tokenize mixed content in XSLT

查看：37 发布时间：2021/9/8 20:23:20 xml xslt xslt-2.0 tokenize

本文介绍了在 XSLT 中标记混合内容的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个包含混合内容的元素 .是否可以使用 XSLT (2.0) 将内的所有单词"(例如由模式 \s+ 分隔)包装在 中<w> 标签，必要时降到行内元素?例如，给定以下输入:

I have an element <mixed> that contains mixed content. Is it possible to use XSLT (2.0) to wrap all "words" (delimited by the pattern \s+, for example) inside <mixed> in a <w> tag, descending into inline elements when necessary? For example, given the following input:

<mixed>
  One morning, when <a>Gregor Samsa</a>
  woke from troubled dreams, he found
  himself transformed in his bed into
  a <b><c>horrible vermin</c></b>.
</mixed>

我想要类似以下的输出:

I want something like the following output:

<mixed>
  <w>One</w> <w>morning,</w> <w>when</w> <a><w>Gregor</w> <w>Samsa</w></a>
  <w>woke</w> <w>from</w> <w>troubled</w> <w>dreams,</w> <w>he</w> <w>found</w>
  <w>himself</w> <w>transformed</w> <w>in</w> <w>his</w> <w>bed</w> <w>into</w>
  <w>a</w> <b><c><w>horrible</w></c></b> <w><b><c>vermin</c></b>.</w>
</mixed>

Dimitre Novatchev 在回答这个相关问题时提供了一个模板解决这个问题的方法很多，但不满足以下要求:

Dimitre Novatchev provided a template in an answer to this related question that goes much of the way to solving this, but does not satisfy the following requirements:

在单词"内终止的内联元素应该被拆分，以便单个元素包含整个单词".否则会出现无效的 XML，例如:

Inline elements that terminate within a "word" should be split so that a single <w> element contains the whole "word." Otherwise there would be invalid XML, such as:

  <w>a</w> <w><b><c>horrible</w> <w>vermin</c></b>.</w>

然而，这个模板在 vermin 之后分离了标点符号 . 并产生:

However, this template detaches the punctuation . after vermin and produces:

  <w>a</w> <b><c><w>horrible</w> <w>vermin</w></c></b> <w>.</(w>

(当前 3 个答案均不满足此要求.)

( None of the current 3 answers satisfy this requirement.)

不得丢弃拆分令牌.考虑在化学式上下文中将非系数数字包装在标签中的类似任务.例如，<反应物>2H2+O2变成<反应物>2H₂.+ O₂.使用 tokenize 函数无法做到这一点，因为它只是丢弃了分隔符.相反，我们可能不得不依靠 analyze-string.

The split token must not be discarded. Consider the similar task of wrapping non-coefficient numbers in  tags in the context of a chemical formula. For example, <reactants>2H2 + O2</reactants> becomes <reactants>2H2 + O2</reactants>. This is not possible using the tokenize function because it simply discards the separator. Instead we will probably have to fall back on analyze-string.

如果不是 XSLT，最好的方法是什么?

If not XSLT, what is the best method to do this?

推荐答案

AFAICT，这将在您的示例中提供预期的结果:

AFAICT, this would provide the expected result in your example:

XSLT 2.0

<xsl:stylesheet version="2.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="no"/>
<xsl:strip-space elements="*"/>

<!-- identity transform -->
<xsl:template match="@*|node()">
    <xsl:copy>
        <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
</xsl:template>

<xsl:template match="text()[ancestor::mixed]">
    <xsl:analyze-string select="." regex="\s+">
        <xsl:matching-substring>
            <xsl:value-of select="." />
        </xsl:matching-substring>
        <xsl:non-matching-substring>
            <w>
                <xsl:value-of select="." />
            </w>
        </xsl:non-matching-substring>
    </xsl:analyze-string>
</xsl:template>

</xsl:stylesheet>

但是，我不明白您关于以单词"结尾的内联元素"的观点.例如，当单词的一部分为斜体时，预期的结果是什么?

However, I did not understand your point regarding "Inline elements that terminate within a "word"". What would be the expected result when, for example, a part of a word is italicized?

这篇关于在 XSLT 中标记混合内容的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在 XSLT 中标记混合内容 [英] Tokenize mixed content in XSLT

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在 XSLT 中标记混合内容 [英] Tokenize mixed content in XSLT

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭