在非格式良好的HTML中使用XPath查找节点(或接近节点) [英] Finding a node (or close to it) using XPath in non well-formed HTML

查看:130
本文介绍了在非格式良好的HTML中使用XPath查找节点(或接近节点)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用XPath在模板中定位了一个节点(或与之相近的节点),该模板具有大约10层深度的非格式良好的HTML。 (不,我没有写这个HTML ...但我已经被分配了。)

我似乎能够检索XPath到有问题的元素使用Firefox的XPartner插件;但它只给我在现场的位置,而不是我已经给出的模板。 (该模板来自非标准的服务器端脚本语言;读取内部构建的语言)

是否有任何XPath工具,您知道这些工具特别好通过非格式化的HTML进行混淆。

解决方案

无法评估XPath表达式agaist是一个非格式化的XML文档,这正是所描述的情况。



可以通过两个链接步骤完成此操作,第一步是将HTML转换为格式良好的XML,那么第二个 - 应用XPath表达式。

因此,这个问题可以更精确地表述为如何将HTML转换为XML,以便可以评估XPath表达式反对它。



以下是两个很好的工具:
$ b


  1. TagSoup ,一个开放源代码程序,是一个基于Java和SAX的工具,由<开发a href =http://home.ccil.org/~cowan/ =nofollow noreferrer> John Cowan 。这是一个用Java编写的兼容SAX的解析器,它不是解析格式良好的或有效的XML,而是解析HTML,因为它在野外发现:糟糕,讨厌和野蛮,尽管常常很短。 TagSoup专为需要使用某种理性应用程序设计外观来处理这些东西的人设计。通过提供SAX接口,它允许将标准XML工具应用于最差的HTML。 TagSoup还包含一个命令行处理器,用于读取HTML文件,并可生成干净的HTML或与XHTML非常接近的格式良好的XML。
    Taggle是TagSoup的一个商业C ++端口。


  2. SgmlReader 是由微软 克里斯洛维特
    SgmlReader是任何SGML文档(包括内置的HTML支持)的XmlReader API。还提供了一个命令行实用程序,用于输出格式良好的XML结果。
    下载包含独立可执行文件和完整源代码的zip文件: SgmlReader.zip


  3. /rel =nofollow noreferrer> David Carlisle 。阅读它的代码对我们每个人来说都是一个很好的学习练习。

  4. $
    $ b

    b
    $ b

    d:htmlparse(string)
    d:htmlparse(string,namespace,html-mode)


    表单相当于)
    d:htmlparse(string,' http://ww.w3.org / b

    使用一些内置的启发式方法将字符串解析为HTML和/或XML)
    control暗示打开和关闭元素。


    它并不完全了解HTML DTD,但它包含完整的
    空元素列表和完整的实体定义列表。HTML实体和
    十进制和十六进制字符引用都被接受。注意html-entities
    即使在html-mode = false()时也被识别。

    元素名称是小写的(如果html模式为true())并放入名称空间参数(wh)指定的
    名称空间ich可以用来表示
    没有名称空间,除非输入有明确的名称空间声明,在
    中,这些将被兑现。



    属性如果html-mode = true()



    阅读更详细的描述 here


    I'm using XPath to locate a node (or something close to it) in a template that has non-well-formed HTML about 10 levels deep. (No I didn't write this HTML...but I've been tasked to dig through it.)

    I seem to be able to retrieve an XPath to the element in question using the XPartner add-on for Firefox; however it only gives me the location in the live site, and not in the template I've been given. (The template is from a non-standard server-side scripting language; read a language built in-house)

    Are there any XPath tools you know of that are particularly good at muddling through non well-formed HTML.

    解决方案

    XPath expressions cannot be evaluated agaist a non-wellformed XML document, which is exactly the described case.

    It is possible to do this in two chained steps, the first of which is to convert the HTML to wellformed XML and then the second -- to apply the XPath expression.

    Therefore, the question could be more precisely stated as "How to convert HTML to XML so that XPath expressions can be evaluated against it".

    Here are two good tools:

    1. TagSoup, an open-source program, is a Java and SAX - based tool, developed by John Cowan. This is a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML. Taggle is a commercial C++ port of TagSoup.

    2. SgmlReader is a tool developed by Microsoft's Chris Lovett. SgmlReader is an XmlReader API over any SGML document (including built in support for HTML). A command line utility is also provided which outputs the well formed XML result. Download the zip file including the standalone executable and the full source code: SgmlReader.zip

    3. The pure XSLT 2.0 Parser of HTML written by David Carlisle. Reading its code would be a great learning exercise for everyone of us.

    From the description:

    "d:htmlparse(string) d:htmlparse(string,namespace,html-mode)

    The one argument form is equivalent to) d:htmlparse(string,'http://ww.w3.org/1999/xhtml',true()))

    Parses the string as HTML and/or XML using some inbuilt heuristics to) control implied opening and closing of elements.

    It doesn't have full knowledge of HTML DTD but does have full list of empty elements and full list of entity definitions. HTML entities, and decimal and hex character references are all accepted. Note html-entities are recognised even if html-mode=false().

    Element names are lowercased (if html-mode is true()) and placed into the namespace specified by the namespace parameter (which may be "" to denote no-namespace unless the input has explict namespace declarations, in which case these will be honoured.

    Attribute names are lowercased if html-mode=true()"

    Read a more detailed description here.

    这篇关于在非格式良好的HTML中使用XPath查找节点(或接近节点)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆