从混合的xml和HTML中仅复制HTML [英] Copy only HTML from mixed xml and HTML
问题描述
我们有一堆文件,这些文件是html页面,但是包含其他xml元素(都以我们的公司名称'TLA'开头),以为我现在正在重写的旧程序提供数据和结构.
We have a bunch of files that are html pages but which contain additional xml elements (all prefixed with our company name 'TLA') to provide data and structure for an older program which I am now rewriting.
示例表格:
<html >
<head>
<title>Highly Simplified Example Form</title>
</head>
<body>
<TLA:document xmlns:TLA="http://www.tla.com">
<TLA:contexts>
<TLA:context id="id_1" value=""></TLA:context>
</TLA:contexts>
<TLA:page>
<TLA:question id="q_id_1">
<table>
<tr>
<td>
<input id="input_id_1" type="text" />
</td>
</tr>
</table>
</TLA:question>
</TLA:page>
<!-- Repeat many times -->
</TLA:document>
</body>
</html>
我的任务是编写一个预处理器,该处理器将仅将html元素及其属性和内容完整地复制到一个新文件中.
My task is to write a pre-processor that will copy only the html elements, complete with their attributes and content into a new file.
赞:
<html >
<head>
<title>Highly Simplified Example Form</title>
</head>
<body>
<table>
<tr>
<td>
<input id="input_id_1" type="text" />
</td>
</tr>
</table>
<!-- Repeat many times -->
</body>
</html>
I've taken the approach of using XSLT as that was what I needed to extract the TLA elements for a different file. So far this is the XSLT I have:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:msxsl="urn:schemas-microsoft-com:xslt" exclude-result-prefixes="msxsl"
xmlns:mbl="http://www.mbl.com">
<xsl:output method="xml" indent="yes"/>
<xsl:strip-space elements="*" />
<xsl:template match="mbl:* | mbl:*/@* | mbl:*/text()"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
但是,这只会产生以下内容:
However this only produces the following:
<html >
<head>
<title>Highly Simplified Example Form</title>
</head>
<body>
</body>
</html>
您可以看到TLA:document元素中的所有内容均被排除.在XSLT中需要进行哪些更改才能获取所有html,但过滤掉TLA元素?
As you can see everything within the TLA:document element is excluded. What needs to be changed in the XSLT to get all the html but filter out the TLA elements?
或者,有没有更简单的方法可以做到这一点?我知道几乎每个浏览器都会忽略TLA元素,那么有没有办法使用HTML工具或应用程序来获取我需要的东西?
Alternatively, is there a simpler way to go about this? I know that virtually every browser will ignore the TLA elements so is there a way to get what I need using an HTML tool or app?
推荐答案
专门针对HTML元素很难,但是,如果您只想从TLA命名空间中排除内容(但仍包括TLA元素中包含的任何非TLA元素包含),那么这应该可以工作:
Specifically targeting HTML elements would be hard, but if you just want to exclude content from the TLA namespace (but still include any non-TLA elements that the TLA elements contain), then this should work:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:mbl="http://www.tla.com" exclude-result-prefixes="mbl">
<xsl:output method="xml" indent="yes"/>
<xsl:strip-space elements="*" />
<xsl:template match="@*|node()" priority="-2">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<!-- This element-only identity template prevents the
TLA namespace declaration from being copied to the output -->
<xsl:template match="*">
<xsl:element name="{name()}">
<xsl:apply-templates select="@* | node()" />
</xsl:element>
</xsl:template>
<!-- Pass processing on to child elements of TLA elements -->
<xsl:template match="mbl:*">
<xsl:apply-templates select="*" />
</xsl:template>
</xsl:stylesheet>
如果要排除具有 any 非空名称空间的任何内容,也可以使用此名称:
You can also use this instead if you want to exclude anything that has any non-null namespace:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:mbl="http://www.tla.com" exclude-result-prefixes="mbl">
<xsl:output method="xml" indent="yes"/>
<xsl:strip-space elements="*" />
<xsl:template match="@*|node()" priority="-2">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="*">
<xsl:element name="{name()}">
<xsl:apply-templates select="@* | node()" />
</xsl:element>
</xsl:template>
<xsl:template match="*[namespace-uri()]">
<xsl:apply-templates select="*" />
</xsl:template>
</xsl:stylesheet>
在您的示例输入上运行任何一个时,结果为:
When either is run on your sample input, the result is:
<html>
<head>
<title>Highly Simplified Example Form</title>
</head>
<body>
<table>
<tr>
<td>
<input id="input_id_1" type="text" />
</td>
</tr>
</table>
</body>
</html>
这篇关于从混合的xml和HTML中仅复制HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!