为什么我的XPath查询(刮取HTML表格)只能在Firebug中工作,但不是我正在开发的应用程序? [英] Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing?

查看:94
本文介绍了为什么我的XPath查询(刮取HTML表格)只能在Firebug中工作,但不是我正在开发的应用程序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



这是为了提供一个规范的Q&A,以便所有类似的(但是太多特别的问题成为一个密切的目标候选人)每周弹出一次或两次。 / em>



我正在开发一个需要使用其中的表解析网站的应用程序。由于派生用于刮取网页的XPath表达式是无聊且容易出错的工作,所以我想使用Firebug的 XPath提取器功能(或其他浏览器中的类似工具)。



示例输入如下所示:

  ! - 剪辑 - > 
< table id =example>
< tr>
<例子单元< / th>
< th>另一个< / th>
< / tr>
< tr>
< td> foobar< / td>
< td> 42< / td>
< / tr>
< / table>
<! - snip - >

我想提取第一个数据单元格(foobar)。 Firebug提出XPath表达式

  // table [@ id =example] / tbody / tr [2] / td [1] 

其中在任何XPath测试插件中都可以正常工作,但不是我自己的应用程序(没有找到结果)。如果我将查询减少到 // table [@id] ,它可以再次工作。



发生什么事?

解决方案



问题:DOM需要< tbody /> ; 标签



Firebug,Chrome的开发者工具,JavaScript和其他功能的XPath功能在 DOM 上工作,而不是基本的 HTML源代码



HTML的HTML要求所有表行都不包含在页眉的表头中(< thead /> < tfoot /> )包含在表体标签< tbody /> 。因此,浏览器如果在解析(X)HTML时丢失则添加此标签。例如, Microsoft的DOM文档


即使表中的 tbody 没有明确定义一个 tbody 元素。


有一个在stackoverflow的另一个答案的深入解释



另一方面, HTML不一定要求使用该标记


始终需要 TBODY 开始标记,除非表只包含一个表体,没有桌面或脚部。




大多数XPath处理器处理原始XML



排除JavaScript,大多数XPath处理器都可以工作在原始XML上,而不是DOM,因此不要添加< tbody /> 标签。 HTML解析器库,如 tag-soup htmltidy 只输出XHTML,而不是DOM- HTMLC#,Google Docs(Spreadsheets)等等,这是Stackoverflow发布的常见问题。 Selenium在浏览器中运行,在DOM上运行,因此不受影响!



重现问题



通过右键单击并选择显示页面来源(或浏览器中的任何内容),比较Firebug(或Chrome的开发工具)显示的源代码或使用 curl http://your.example.org 在命令行。后者可能不会包含任何< tbody /> 元素(很少使用),Firebug将始终显示它们。






解决方案1:删除 / tbody Axis Step



检查你卡住的表是否真的不包含< tbody /> 元素(请参阅最后一段)。如果是这样,你可能有另一种问题。



现在删除 / tbody axis step,所以你的查询将看起来像

  // table [@ id =example] / tr [2] / td [1] 



解决方案2:跳过< tbody /> 标签



这是一个非常脏的解决方案,可能会为嵌套表失败(可以跳转到内表)。



用后代替换 / tbody axis step, or-self step:

  // table [@ id =example] // tr [2] / td [1] 



解决方案3:允许输入有和没有< tbody /> 标签



如果您不提前确定您的表或在HTML中使用查询源和DOM上下文;并且不希望/不能使用解决方案2中的黑客,提供替代查询(对于XPath 1.0)或使用可选轴步骤(XPath 2.0及更高版本)。




  • XPath 1.0

    // table [@ id =example] / tr [2] td [1] | //表[@ id =example] / tbody / tr [2] / td [1]

  • XPath 2.0 // table [@ id =example] /(tbody,。)/ tr [2] / td [1]


This is meant to provide a canonical Q&A to all that similar (but much too specific questions to be a close target candidate) popping up once or twice a week.

I'm developing an application that needs to parse a website with tables in it. As deriving XPath expression for scraping web pages is boring and error-prone work, I'd like to use the XPath extractor feature of Firebug (or similar tools in other browsers) for this.

Example input looks like this:

<!-- snip -->
<table id="example">
  <tr>
    <th>Example Cell</th>
    <th>Another one</th>
  </tr>
  <tr>
    <td>foobar</td>
    <td>42</td>
  </tr>
</table>
<!-- snip -->

I want to extract the first data cell ("foobar"). Firebug proposes the XPath expression

//table[@id="example"]/tbody/tr[2]/td[1]

which works fine in any XPath tester plugins, but not my own application (no results found). If I cut down the query to //table[@id], it works again.

What's going wrong?

解决方案

The Problem: DOM Requires <tbody/> Tags

Firebug, Chrome's Developer Tool, XPath functions in JavaScript and others work on the DOM, not the basic HTML source code.

The DOM for HTML requires that all table rows not contained in a table header of footer (<thead/>, <tfoot/>) are included in table body tags <tbody/>. Thus, browsers add this tag if it's missing while parsing (X)HTML. For example, Microsoft's DOM documentation says

The tbody element is exposed for all tables, even if the table does not explicitly define a tbody element.

There is an in-depth explanation in another answer on stackoverflow.

On the other hand, HTML does not necessarily require that tag to be used:

The TBODY start tag is always required except when the table contains only one table body and no table head or foot sections.

Most XPath Processors Work on raw XML

Excluding JavaScript, most XPath processors work on raw XML, not the DOM, thus do not add <tbody/> tags. Also HTML parser libraries like and only output XHTML, not "DOM-HTML".

This is a common problem posted on Stackoverflow for PHP, Ruby, Python, Java, C#, Google Docs (Spreadsheets) and lots of others. Selenium runs inside the browser and works on the DOM -- so it is not affected!

Reproducing the Issue

Compare the source shown by Firebug (or Chrome's Dev Tools) with the one you get by right-clicking and selecting "Show Page Source" (or whatever it's called in your browsers) -- or by using curl http://your.example.org on the command line. Latter will probably not contain any <tbody/> elements (they're rarely used), Firebug will always show them.


Solution 1: Remove /tbody Axis Step

Check if the table you're stuck at really does not contain a <tbody/> element (see last paragraph). If it does, you've probably got another kind of problem.

Now remove the /tbody axis step, so your query will look like

//table[@id="example"]/tr[2]/td[1]

Solution 2: Skip <tbody/> Tags

This is a rather dirty solution and likely to fail for nested tables (can jump into inner tables). I would only recommend to to this in very rare cases.

Replace the /tbody axis step by a descendant-or-self step:

//table[@id="example"]//tr[2]/td[1]

Solution 3: Allow Both Input With and Without <tbody/> Tags

If you're not sure in advance that your table or use the query in both "HTML source" and DOM context; and don't want/cannot use the hack from solution 2, provide an alternative query (for XPath 1.0) or use an "optional" axis step (XPath 2.0 and higher).

  • XPath 1.0:
    //table[@id="example"]/tr[2]/td[1] | //table[@id="example"]/tbody/tr[2]/td[1]
  • XPath 2.0: //table[@id="example"]/(tbody, .)/tr[2]/td[1]

这篇关于为什么我的XPath查询(刮取HTML表格)只能在Firebug中工作,但不是我正在开发的应用程序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆