我可以使用 XSLT 解析 HTML 吗? [英] Can I parse an HTML using XSLT?

查看:29
本文介绍了我可以使用 XSLT 解析 HTML 吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须解析一个大的 HTML 文件,而我只对一小部分(一个表格)感兴趣.所以我考虑使用 XSLT 以更简单的方式简化/转换 HTML,然后我可以轻松处理.

我遇到的问题是找不到我的桌子.所以我不知道是否可以使用 XSL 样式表解析 HTML.

顺便说一下,HTML 文件是这样的(原理图,缺少标签):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html id="ctl00_htmlDocumento" xmlns="http://www.w3.org/1999/xhtml" lang="es-ES" xml:lang="es-ES"><div>一些内容

<div class="NON_IMPORTANT"></div><div class="IMPORTANT_FATHER><div class="重要事项"><表格>这是我要找的数据

根据要求,这是我的 xsl

完整的 HTML 非常大,所以我不知道如何在此处展示它...我已经在 Oxygen 上测试了有效文档,它说它有效.

提前致谢.贡索

解决方案

您没有在 match 属性中正确使用 XPath.您需要 xsl:stylesheet 元素中的 xmlns:xhtml="http://www.w3.org/1999/xhtml" 属性,然后您需要在您的 XPath 表达式中使用 xhtml: 前缀(您需要一个前缀;XPath 不遵守默认命名空间).

在此之后,您仍然会遇到它也会处理其他所有内容的问题.我不知道是否有更好的解决方案,但我认为您需要显式处理 tbody 元素的路径上的内容,例如

<xsl:apply-templates select="xhtml:body"/></xsl:模板>

对于 body 也是如此,直到您达到 tbody 匹配为止.

XPath 还支持更复杂的匹配,而不仅仅是如上所述的特定子项.例如,匹配第三个子 div 标签可以用

并将具有特定属性的元素与

匹配

这里的 [] 包含一个额外的条件,需要满足该条件才能将元素视为匹配.普通数字意味着对匹配进行索引并仅采用具有该索引的那个(索引从 1 开始),@ 符号位于属性之前,但您可以在其中使用任意复杂的 XPath,因此您几乎可以匹配任何您想要的子结构.

I have to parse a big HTML file, and Im only interested in a small section (a table). So I thought about using an XSLT to simplify/transform the HTML in something simpler that I could then easily process.

The problem Im having is that the is not finding my table. So I don't know if its even possible to parse HTML using a XSL stylesheet.

By the way, the HTML file has this look (schematic, missing tags):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html id="ctl00_htmlDocumento" xmlns="http://www.w3.org/1999/xhtml" lang="es-ES" xml:lang="es-ES">
<div> some content </div>
<div class="NON_IMPORTANT"></div>
<div class="IMPORTANT_FATHER>
    <div class="IMPORTANT">
        <table>
            HERE IS THE DATA IM LOOKING FOR
        </table>
    </div>
</div>

as per request, here is my xsl

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:template match="tbody">
        tbody found, lets process it
    <xsl:for-each select="tr">
        new tf found, lets process it
    </xsl:for-each>
    </xsl:template>

</xsl:stylesheet>

The full HTML is quite big so I dont know how to present it here... I've tested for valid document on Oxygen, and it says its valid.

Thanks in advance. Gonso

解决方案

You're not using XPath correctly in your match attributes. You need the xmlns:xhtml="http://www.w3.org/1999/xhtml" attribute in your xsl:stylesheet element, and then you'll need to use the xhtml: prefix in your XPath expressions (you need a prefix; XPath does not obey default namespaces).

After this, you'll still get the problem that it will process everything else too. I don't know if there's a better solution to this, but I think you will need to explicitly process things on the path to the tbody element, something like

<xsl:template match="xhtml:html">
  <xsl:apply-templates select="xhtml:body"/>
</xsl:template>

and the same thing for body and so on until you get to your tbody match.

XPath also supports more complex matching than just a specific child as above. For instance, matching the third child div tag can be done with

<xsl:template match="xhtml:div[3]">

and matching an element with a specific attribute with

<xsl:template match="xhtml:div[@class='IMPORTANT']">

Here the [] surrounds an additional condition that needs to be fulfilled for the element to be considered a match. A plain number means to index into the matches and take only the one that has that index (the indexing is 1-based), an @ sign precedes an attribute, but you can have arbitrarily complex XPath in there, so you can match pretty much any substructure you'd like.

这篇关于我可以使用 XSLT 解析 HTML 吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆