将 OOXML 内联格式转换为合并元素 [英] Convert OOXML inline formatting to a merged element

查看:17
本文介绍了将 OOXML 内联格式转换为合并元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 OOXML 中,粗体、斜体等格式可以(而且通常令人讨厌的是)在多个元素之间拆分,如下所示:

In OOXML, formatting such as bold, italic, etc. can be (and often annoyingly is) split up between multiple elements, like so:

<w:p>
    <w:r>
        <w:rPr>
            <w:b/>
         </w:rPr>
         <w:t xml:space="preserve">This is a </w:t>
    </w:r>
    <w:r>
        <w:rPr>
            <w:b/>
        </w:rPr>
        <w:t xml:space="preserve">bold </w:t>
    </w:r>
    <w:r>
        <w:rPr>
            <w:b/>
            <w:i/>
        </w:rPr>
        <w:t>with a bit of italic</w:t>
    </w:r>
    <w:r>
        <w:rPr>
            <w:b/>
        </w:rPr>
        <w:t xml:space="preserve"> </w:t>
    </w:r>
    <w:r>
        <w:rPr>
            <w:b/>
        </w:rPr>
        <w:t>paragr</w:t>
    </w:r>
    <w:r>
        <w:rPr>
            <w:b/>
        </w:rPr>
        <w:t>a</w:t>
    </w:r>
    <w:r>
        <w:rPr>
            <w:b/>
        </w:rPr>
        <w:t>ph</w:t>
    </w:r>
    <w:r>
        <w:t xml:space="preserve"> with some non-bold in it too.</w:t>
    </w:r>
</w:p>

我需要组合这些格式元素来产生这个:

I need to combine these formatting elements to produce this:

<p><b>This is a mostly bold <i>with a bit of italic</i> paragraph</b> with some non-bold in it too.</p>

我最初的方法是在第一次遇到时写出开始格式化标签:

My initial approach was going to be to write out the start formatting tag when it is first encountered using:

 <xsl:text disable-output-escaping="yes">&lt;b&gt;</xsl:text>

然后在我处理每个 之后,检查下一个以查看格式是否仍然存在.如果不是,请以与添加开始标记相同的方式添加结束标记.我一直认为必须有更好的方法来做到这一点,我将不胜感激任何建议.

And then after I process each <w:r>, check the next one to see if the formatting is still present. If it's not, add the end tag in the same way I add the start tag. I keep thinking there must be a better way to do this, and I'd be grateful for any suggestions.

还应该提到我与 XSLT 1.0 相关.

Should also mention that I am tied to XSLT 1.0.

之所以需要这个,是因为我们需要比较一个XML文件在转化为OOXML之前和从OOXML转化出来之后.额外的格式标记使它看起来好像做了更改,而实际上却没有.

The reason for needing this, is that we need to compare an XML file before it is transformed into OOXML, and after it is transformed out of OOXML. The extra formatting tags make it appear as though changes were made when they were not.

推荐答案

这是一个完整的 XSLT 1.0 解决方案:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:ext="http://exslt.org/common" xmlns:w="w"
 exclude-result-prefixes="ext w">
 <xsl:output omit-xml-declaration="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="w:p">
  <xsl:variable name="vrtfPass1">
   <p>
    <xsl:apply-templates/>
   </p>
  </xsl:variable>

  <xsl:apply-templates mode="pass2"
   select="ext:node-set($vrtfPass1)/*"/>
 </xsl:template>

 <xsl:template match="w:r">
  <xsl:variable name="vrtfProps">
   <xsl:for-each select="w:rPr/*">
    <xsl:sort select="local-name()"/>
    <xsl:copy-of select="."/>
   </xsl:for-each>
  </xsl:variable>

  <xsl:call-template name="toHtml">
   <xsl:with-param name="pProps" select=
       "ext:node-set($vrtfProps)/*"/>
   <xsl:with-param name="pText" select="w:t/text()"/>
  </xsl:call-template>
 </xsl:template>

 <xsl:template name="toHtml">
  <xsl:param name="pProps"/>
  <xsl:param name="pText"/>

  <xsl:choose>
   <xsl:when test="not($pProps)">
     <xsl:copy-of select="$pText"/>
   </xsl:when>
   <xsl:otherwise>
    <xsl:element name="{local-name($pProps[1])}">
      <xsl:call-template name="toHtml">
        <xsl:with-param name="pProps" select=
            "$pProps[position()>1]"/>
        <xsl:with-param name="pText" select="$pText"/>
      </xsl:call-template>
    </xsl:element>
   </xsl:otherwise>
  </xsl:choose>
 </xsl:template>

  <xsl:template match="/*" mode="pass2">
  <xsl:copy>
    <xsl:copy-of select="@*"/>
    <xsl:call-template name="processInner">
     <xsl:with-param name="pNodes" select="node()"/>
    </xsl:call-template>
  </xsl:copy>
 </xsl:template>

 <xsl:template name="processInner">
  <xsl:param name="pNodes"/>

  <xsl:variable name="pNode1" select="$pNodes[1]"/>

  <xsl:if test="$pNode1">
   <xsl:choose>
    <xsl:when test="not($pNode1/self::*)">
     <xsl:copy-of select="$pNode1"/>
     <xsl:call-template name="processInner">
      <xsl:with-param name="pNodes" select=
      "$pNodes[position()>1]"/>
     </xsl:call-template>
    </xsl:when>
    <xsl:otherwise>
      <xsl:variable name="vbatchLength">
        <xsl:call-template name="getBatchLength">
         <xsl:with-param name="pNodes"
              select="$pNodes[position()>1]"/>
         <xsl:with-param name="pName"
             select="name($pNode1)"/>
         <xsl:with-param name="pCount" select="1"/>
        </xsl:call-template>
      </xsl:variable>

      <xsl:element name="{name($pNode1)}">
        <xsl:copy-of select="@*"/>

        <xsl:call-template name="processInner">
         <xsl:with-param name="pNodes" select=
         "$pNodes[not(position()>$vbatchLength)]
                        /node()"/>
        </xsl:call-template>
      </xsl:element>

      <xsl:call-template name="processInner">
       <xsl:with-param name="pNodes" select=
       "$pNodes[position()>$vbatchLength]"/>
      </xsl:call-template>
    </xsl:otherwise>
   </xsl:choose>
  </xsl:if>
 </xsl:template>

 <xsl:template name="getBatchLength">
  <xsl:param name="pNodes"/>
  <xsl:param name="pName"/>
  <xsl:param name="pCount"/>

  <xsl:choose>
   <xsl:when test=
   "not($pNodes) or not($pNodes[1]/self::*)
    or not(name($pNodes[1])=$pName)">
   <xsl:value-of select="$pCount"/>
   </xsl:when>
   <xsl:otherwise>
    <xsl:call-template name="getBatchLength">
     <xsl:with-param name="pNodes" select=
         "$pNodes[position()>1]"/>
     <xsl:with-param name="pName" select="$pName"/>
     <xsl:with-param name="pCount" select="$pCount+1"/>
    </xsl:call-template>
   </xsl:otherwise>
  </xsl:choose>
 </xsl:template>
</xsl:stylesheet>

当此转换应用于以下 XML 文档时(基于所提供的,但为了展示如何涵盖更多边缘情况而变得更复杂):

when this transformation is applied to the following XML document (based on the provided, but made more complicated to show how more edge-cases are covered):

<w:p xmlns:w="w">
    <w:r>
        <w:rPr>
            <w:b/>
        </w:rPr>
        <w:t xml:space="preserve">This is a </w:t>
    </w:r>
    <w:r>
        <w:rPr>
            <w:b/>
        </w:rPr>
        <w:t xml:space="preserve">bold </w:t>
    </w:r>
    <w:r>
        <w:rPr>
            <w:b/>
            <w:i/>
        </w:rPr>
        <w:t>with a bit of italic</w:t>
    </w:r>
    <w:r>
        <w:rPr>
            <w:b/>
            <w:i/>
        </w:rPr>
        <w:t> and some more italic</w:t>
    </w:r>
    <w:r>
        <w:rPr>
            <w:i/>
        </w:rPr>
        <w:t> and just italic, no-bold</w:t>
    </w:r>
    <w:r>
        <w:rPr>
            <w:b/>
        </w:rPr>
        <w:t xml:space="preserve"></w:t>
    </w:r>
    <w:r>
        <w:rPr>
            <w:b/>
        </w:rPr>
        <w:t>paragr</w:t>
    </w:r>
    <w:r>
        <w:rPr>
            <w:b/>
        </w:rPr>
        <w:t>a</w:t>
    </w:r>
    <w:r>
        <w:rPr>
            <w:b/>
        </w:rPr>
        <w:t>ph</w:t>
    </w:r>
    <w:r>
        <w:t xml:space="preserve"> with some non-bold in it too.</w:t>
    </w:r>
</w:p>

产生想要的、正确的结果:

<p><b>This is a bold <i>with a bit of italic and some more italic</i></b><i> and just italic, no-bold</i><b>paragraph</b> with some non-bold in it too.</p>

说明:

  1. 这是一个两遍的转换.第一遍相对简单,将源 XML 文档(在我们的特定情况下)转换为以下内容:
  1. This is a two-pass transformation. The first pass is relatively simple and converts the source XML document (in our specific case) to the following:

pass1 结果(为了便于阅读而缩进):

pass1 result (indented for readability):

<p>
   <b>This is a </b>
   <b>bold </b>
   <b>
      <i>with a bit of italic</i>
   </b>
   <b>
      <i> and some more italic</i>
   </b>
   <i> and just italic, no-bold</i>
   <b/>
   <b>paragr</b>
   <b>a</b>
   <b>ph</b> with some non-bold in it too.</p>

.2.第二遍(以 "pass2" 模式执行)将任何一批连续的同名元素合并为具有该名称的单个元素.它在合并元素的子元素上递归调用自身 - 因此合并任何深度的批次.

.2. The second pass (executed in mode "pass2") merges any batch of consecutive and identically named elements into a single element with that name. It recursively calls-itself on the children of the merged elements -- thus batches at any depth are merged.

.3.请注意:我们不(也不能)使用轴following-sibling::preceding-sibling,因为只有节点(到合并)在顶层实际上是兄弟姐妹.由于这个原因,我们将所有节点都当作一个节点集来处理.

.3. Do note: We do not (and cannot) use the axes following-sibling:: or preceding-sibling, because only the nodes (to be merged) at the top level are really siblings. Due to this reason we process all nodes just as a node-set.

.4.这个解决方案是完全通用的——它在任何深度合并任何一批连续的同名元素——并且没有硬编码的特定名称.

.4. This solution is completely generic -- it merges any batch of consecutive identically-named elements at any depth -- and no specific names are hardcoded.

这篇关于将 OOXML 内联格式转换为合并元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆