从XML中提取HTML时关闭标签 [英] Closing tags when extracting HTML from XML

查看:164
本文介绍了从XML中提取HTML时关闭标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用xslt样式表转换混合html和xml文档,并仅提取html元素。



源文件:

 <?xml version =1.0encoding =utf-8?> 
< html>
< head>
< title>简化示例表格< / title>
< / head>
< body>
< TLA:document xmlns:TLA =http://www.TLA.com>
< TLA:上下文>
< / TLA:上下文>
< table id =table_logostyle =display:inline>
< tr>
< td height =20align =middle>大标题出现在这里< / td>
< / tr>
< tr>
< td align =center>
< img src =logo.jpgborder =0>< / img>
< / td>
< / tr>
< / table>
< TLA:page>
< TLA:question id =q_id_1>
< table id =table_id_1>
< tr>
< td>标签文字在这里< / td>
< td>
< input id =input_id_1type =text>< / input>
< / td>
< / tr>
< / table>
< / TLA:问题>
< / TLA:page>
<! - 重复多次 - >
< / TLA:文件>
< / body>
< / html>

样式表:

 < xsl:stylesheet version =1.0xmlns:xsl =http://www.w3.org/1999/XSL/Transform
xmlns:TLA =http:// www .TLA.comexclude-result-prefixes =TLA>
< xsl:output method =htmlindent =yesversion =4.0/>

< xsl:template match =@ * | node()priority = - 2>
< xsl:copy>
< xsl:apply-templates select =@ * | node()/>
< / xsl:copy>
< / xsl:template>

<! - 这个仅用于元素的标识模板可防止将
TLA名称空间声明复制到输出 - >
< xsl:template match =*>
< xsl:element name ={name()}>
< xsl:apply-templates select =@ * | node()/>
< / xsl:element>
< / xsl:template>

<! - 将处理传递给TLA元素的子元素 - >
< xsl:template match =TLA:*>
< xsl:apply-templates select =*/>
< / xsl:template>
< / xsl:stylesheet>

输出:

 < HTML> 
< head>
< META http-equiv =Content-Typecontent =text / html; charset = utf-8>
< title>简化示例表格< / title>
< / head>
< body>
< table id =table_logostyle =display:inline>
< tr>
< td height =20align =middle>大标题出现在这里< / td>
< / tr>
< tr>
< td align =center>< img src =logo.jpgborder =0>< / td>
< / tr>
< / table>
< table id =table_id_1>
< tr>
< td>标签文字在这里< / td>
< td>< input id =input_id_1type =text>< / td>
< / tr>
< / table>
< / body>
< / html>

然而,meta,img和input元素没有正确关闭。我已经将xsl:output设置为html,并将版本设置为4.0,所以据我所知他们应该输出正确的html。



I猜测在第一个xsl:template / xsl:copy指令中需要进行微妙的更改,但我的xslt技能受到了极大限制。

需要做什么修改让标签正确关闭?



PS我不确定在不同的工具/解析器之间是否存在差异,但我使用Visual Studio 2012来调试样式表,以便我可以看到任何更改的直接影响。

解决方案

< meta> < img> < input> 元素不需要关闭—它仍然是有效的HTML。



如果您希望关闭它们,您可以使用 xml (使用XSLT2.0你可以使用 xhtml ,就我所知)作为输出方法并添加< meta> 如果您需要,可以标记自己。例如:
$ b

样式表



 < xsl:stylesheet version = 1.0xmlns:xsl =http://www.w3.org/1999/XSL/Transform
xmlns:TLA =http://www.TLA.comexclude-result-prefixes = TLA>
< xsl:output method =xmlindent =yesomit-xml-declaration =yes/>

< xsl:template match =@ * | node()priority = - 2>
< xsl:copy>
< xsl:apply-templates select =@ * | node()/>
< / xsl:copy>
< / xsl:template>

< xsl:template match =head>
< xsl:copy>
< meta http-equiv =Content-Typecontent =text / html; charset = utf-8/>
< xsl:apply-templates select =@ * | node()/>
< / xsl:copy>
< / xsl:template>

<! - 这个仅用于元素的标识模板可防止将
TLA名称空间声明复制到输出 - >
< xsl:template match =*>
< xsl:element name ={name()}>
< xsl:apply-templates select =@ * | node()/>
< / xsl:element>
< / xsl:template>

<! - 将处理传递给TLA元素的子元素 - >
< xsl:template match =TLA:*>
< xsl:apply-templates select =*/>
< / xsl:template>
< / xsl:stylesheet>



输出



 < HTML> 
< head>
< meta http-equiv =Content-Typecontent =text / html; charset = utf-8/>
< title>简化示例表格< / title>
< / head>
< body>
< table id =table_logostyle =display:inline>
< tr>
< td height =20align =middle>大标题出现在这里< / td>
< / tr>
< tr>
< td align =center>
< img src =logo.jpgborder =0/>
< / td>
< / tr>
< / table>
< table id =table_id_1>
< tr>
< td>标签文字在这里< / td>
< td>
< input id =input_id_1type =text/>
< / td>
< / tr>
< / table>
< / body>
< / html>


I am transforming a mixed html and xml document using an xslt stylesheet and extracting only the html elements.

Source file:

<?xml version="1.0" encoding="utf-8" ?>
<html >
  <head>
    <title>Simplified Example Form</title>
  </head>
  <body>
    <TLA:document xmlns:TLA="http://www.TLA.com">
      <TLA:contexts>
        <TLA:context id="id_1" value=""></TLA:context>
      </TLA:contexts>
      <table id="table_logo" style="display:inline">
        <tr>
          <td height="20" align="middle">Big Title Goes Here</td>
        </tr>
        <tr>
          <td align="center">
            <img src="logo.jpg" border="0"></img>
          </td>
        </tr>
      </table>
      <TLA:page>
        <TLA:question id="q_id_1">
          <table id="table_id_1">
            <tr>
              <td>Label text goes here</td>
              <td>
                <input id="input_id_1" type="text"></input>
              </td>
            </tr>
          </table>
        </TLA:question>
      </TLA:page>
      <!-- Repeat many times -->
    </TLA:document>
  </body>
</html>

Stylesheet:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:TLA="http://www.TLA.com" exclude-result-prefixes="TLA">
  <xsl:output method="html" indent="yes" version="4.0" />
  <xsl:strip-space elements="*" />

  <xsl:template match="@*|node()" priority="-2">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <!-- This element-only identity template prevents the 
       TLA namespace declaration from being copied to the output -->
  <xsl:template match="*">
    <xsl:element name="{name()}">
      <xsl:apply-templates select="@* | node()" />
    </xsl:element>
  </xsl:template>

  <!-- Pass processing on to child elements of TLA elements -->
  <xsl:template match="TLA:*">
    <xsl:apply-templates select="*" />
  </xsl:template>
</xsl:stylesheet>

Output:

<html>
  <head>
    <META http-equiv="Content-Type" content="text/html; charset=utf-8">
    <title>Simplified Example Form</title>
  </head>
  <body>
    <table id="table_logo" style="display:inline">
      <tr>
        <td height="20" align="middle">Big Title Goes Here</td>
      </tr>
      <tr>
        <td align="center"><img src="logo.jpg" border="0"></td>
      </tr>
    </table>
    <table id="table_id_1">
      <tr>
        <td>Label text goes here</td>
        <td><input id="input_id_1" type="text"></td>
      </tr>
    </table>
  </body>
</html>

However there's a problem in that the meta, img, and input elements are not being closed correctly. I've set the xsl:output to html and the version to 4.0 so as far as I know they should output correct html.

I'm guessing that there needs to be a subtle change in the first xsl:template/xsl:copy instruction but my xslt skills are highly limited.

What change needs to be made to get the tags to close correctly?

P.S. I'm not sure if there's a difference between different tools/parsers but I'm using Visual Studio 2012 to debug the stylesheet so that I can see the immediate effect of any changes.

解决方案

The <meta>, <img> and <input> elements don't need to be closed — it's still valid HTML.

If you want to have them closed, you could use xml (with XSLT2.0 you could use xhtml, too, as far as I know) as the output method and add the <meta> tag yourself if you need it. For example:

Stylesheet

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:TLA="http://www.TLA.com" exclude-result-prefixes="TLA">
  <xsl:output method="xml" indent="yes" omit-xml-declaration="yes"/>
  <xsl:strip-space elements="*" />

  <xsl:template match="@*|node()" priority="-2">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="head">
    <xsl:copy>
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <!-- This element-only identity template prevents the 
       TLA namespace declaration from being copied to the output -->
  <xsl:template match="*">
    <xsl:element name="{name()}">
      <xsl:apply-templates select="@* | node()" />
    </xsl:element>
  </xsl:template>

  <!-- Pass processing on to child elements of TLA elements -->
  <xsl:template match="TLA:*">
    <xsl:apply-templates select="*" />
  </xsl:template>
</xsl:stylesheet>

Output

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
    <title>Simplified Example Form</title>
  </head>
  <body>
    <table id="table_logo" style="display:inline">
      <tr>
        <td height="20" align="middle">Big Title Goes Here</td>
      </tr>
      <tr>
        <td align="center">
          <img src="logo.jpg" border="0"/>
        </td>
      </tr>
    </table>
    <table id="table_id_1">
      <tr>
        <td>Label text goes here</td>
        <td>
          <input id="input_id_1" type="text"/>
        </td>
      </tr>
    </table>
  </body>
</html>

这篇关于从XML中提取HTML时关闭标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆