我可以使用XSLT解析HTML吗? [英] Can I parse an HTML using XSLT?

查看:115
本文介绍了我可以使用XSLT解析HTML吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须解析一个大的HTML文件,而Im只对一小段(一张表)感兴趣。
所以我想过使用XSLT来简化/转换HTML,然后我可以轻松地处理它。



我遇到的问题是找不到我的桌子。所以我不知道它是否可以使用XSL样式表解析HTML。



顺便说一句,HTML文件具有这种外观(原理图,缺少标签):

 <!DOCTYPE html PUBLIC -  // W3C // DTD XHTML 1.0 Transitional // ENhttp:// www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\"> 
< div>一些内容< / div>
< div class =NON_IMPORTANT>< / div>
< div class =IMPORTANT_FATHER>
< div class =重要>
< table>
这里是数据IM寻找
< ; / table>
< / div>
< / div>

根据请求,这里是我的xsl

 <?xml version =1.0编码=UTF-8?> 
< xsl:stylesheet xmlns:xsl =http://www.w3.org/1999/XSL/Transformversion =1.0>
< xsl:template match =tbody>
找到了tbody,让它处理它
< xsl:for-each select =tr>
找到新的tf, it
< / xsl:for-each>
< / xsl:template>

< / xsl:stylesheet>

完整的HTML相当大,所以我不知道如何在这里展示它...我已经在Oxygen上测试过有效的文档,它说它的有效性。



预先感谢。
Gonso

解决方案

您在 match 属性中没有正确使用XPath。在 xsl:stylesheet 中需要 xmlns:xhtml =http://www.w3.org/1999/xhtml属性c $ c>元素,然后您需要在XPath表达式中使用 xhtml:前缀(您需要一个前缀; XPath不遵守默认名称空间) / p>

在此之后,您仍然会遇到它将处理其他所有事情的问题。我不知道是否有更好的解决方案,但我认为您需要明确处理 tbody 元素的路径,如

 < xsl:template match =xhtml:html> 
< xsl:apply-templates select =xhtml:body/>
< / xsl:template>

body 等同样的事情直到找到 tbody 匹配为止。



XPath还支持更复杂的匹配,而不仅仅是上述特定的子项。例如,匹配第三个孩子 div 标签可以用

  < xsl:template match =xhtml:div [3]> 

并且与具有特定属性的元素匹配

 < xsl:template match =xhtml:div [@ class ='IMPORTANT']> 

这里 [] 包围了一个附加条件需要满足的要素被视为匹配。一个普通的数字意味着索引到匹配中,并且只取得那个具有该索引的索引(索引是基于1的),一个 @ 符号在属性之前,但是可以在那里有任意复杂的XPath,所以你可以匹配任何你喜欢的子结构。


I have to parse a big HTML file, and Im only interested in a small section (a table). So I thought about using an XSLT to simplify/transform the HTML in something simpler that I could then easily process.

The problem Im having is that the is not finding my table. So I don't know if its even possible to parse HTML using a XSL stylesheet.

By the way, the HTML file has this look (schematic, missing tags):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html id="ctl00_htmlDocumento" xmlns="http://www.w3.org/1999/xhtml" lang="es-ES" xml:lang="es-ES">
<div> some content </div>
<div class="NON_IMPORTANT"></div>
<div class="IMPORTANT_FATHER>
    <div class="IMPORTANT">
        <table>
            HERE IS THE DATA IM LOOKING FOR
        </table>
    </div>
</div>

as per request, here is my xsl

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:template match="tbody">
        tbody found, lets process it
    <xsl:for-each select="tr">
        new tf found, lets process it
    </xsl:for-each>
    </xsl:template>

</xsl:stylesheet>

The full HTML is quite big so I dont know how to present it here... I've tested for valid document on Oxygen, and it says its valid.

Thanks in advance. Gonso

解决方案

You're not using XPath correctly in your match attributes. You need the xmlns:xhtml="http://www.w3.org/1999/xhtml" attribute in your xsl:stylesheet element, and then you'll need to use the xhtml: prefix in your XPath expressions (you need a prefix; XPath does not obey default namespaces).

After this, you'll still get the problem that it will process everything else too. I don't know if there's a better solution to this, but I think you will need to explicitly process things on the path to the tbody element, something like

<xsl:template match="xhtml:html">
  <xsl:apply-templates select="xhtml:body"/>
</xsl:template>

and the same thing for body and so on until you get to your tbody match.

XPath also supports more complex matching than just a specific child as above. For instance, matching the third child div tag can be done with

<xsl:template match="xhtml:div[3]">

and matching an element with a specific attribute with

<xsl:template match="xhtml:div[@class='IMPORTANT']">

Here the [] surrounds an additional condition that needs to be fulfilled for the element to be considered a match. A plain number means to index into the matches and take only the one that has that index (the indexing is 1-based), an @ sign precedes an attribute, but you can have arbitrarily complex XPath in there, so you can match pretty much any substructure you'd like.

这篇关于我可以使用XSLT解析HTML吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆