为什么我的XPath查询(刮取HTML表格)只能在Firebug中工作,但不是我正在开发的应用程序? [英] Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing?
问题描述
这是为了提供一个规范的Q&A,以便所有类似的(但是太多特别的问题成为一个密切的目标候选人)每周弹出一次或两次。 / em>
我正在开发一个需要使用其中的表解析网站的应用程序。由于派生用于刮取网页的XPath表达式是无聊且容易出错的工作,所以我想使用Firebug的 XPath提取器功能(或其他浏览器中的类似工具)。
示例输入如下所示:
! - 剪辑 - >
< table id =example>
< tr>
<例子单元< / th>
< th>另一个< / th>
< / tr>
< tr>
< td> foobar< / td>
< td> 42< / td>
< / tr>
< / table>
<! - snip - >
我想提取第一个数据单元格(foobar)。 Firebug提出XPath表达式
// table [@ id =example] / tbody / tr [2] / td [1]
其中在任何XPath测试插件中都可以正常工作,但不是我自己的应用程序(没有找到结果)。如果我将查询减少到 // table [@id]
,它可以再次工作。
发生什么事?
问题:DOM需要< tbody /> ;
标签
Firebug,Chrome的开发者工具,JavaScript和其他功能的XPath功能在 DOM 上工作,而不是基本的 HTML源代码。
HTML的HTML要求所有表行都不包含在页眉的表头中(< thead />
,< tfoot />
)包含在表体标签< tbody />
。因此,浏览器如果在解析(X)HTML时丢失则添加此标签。例如, Microsoft的DOM文档说
即使表中的
tbody
没有明确定义一个tbody
元素。
另一方面, HTML不一定要求使用该标记:
始终需要
TBODY
开始标记,除非表只包含一个表体,没有桌面或脚部。
大多数XPath处理器处理原始XML
排除JavaScript,大多数XPath处理器都可以工作在原始XML上,而不是DOM,因此不要添加< tbody />
标签。 HTML解析器库,如 tag-soup 和 htmltidy 只输出XHTML,而不是DOM- HTMLC#,Google Docs(Spreadsheets)等等,这是Stackoverflow发布的常见问题。 Selenium在浏览器中运行,在DOM上运行,因此不受影响!
重现问题
通过右键单击并选择显示页面来源(或浏览器中的任何内容),比较Firebug(或Chrome的开发工具)显示的源代码或使用 curl http://your.example.org
在命令行。后者可能不会包含任何< tbody />
元素(很少使用),Firebug将始终显示它们。
解决方案1:删除 / tbody
Axis Step
检查你卡住的表是否真的不包含< tbody />
元素(请参阅最后一段)。如果是这样,你可能有另一种问题。
现在删除 / tbody
axis step,所以你的查询将看起来像
// table [@ id =example] / tr [2] / td [1]
解决方案2:跳过< tbody />
标签
这是一个非常脏的解决方案,可能会为嵌套表失败(可以跳转到内表)。
用后代替换 / tbody
axis step, or-self step:
// table [@ id =example] // tr [2] / td [1]
解决方案3:允许输入有和没有< tbody />
标签
如果您不提前确定您的表或在HTML中使用查询源和DOM上下文;并且不希望/不能使用解决方案2中的黑客,提供替代查询(对于XPath 1.0)或使用可选轴步骤(XPath 2.0及更高版本)。
- XPath 1.0 :
// table [@ id =example] / tr [2] td [1] | //表[@ id =example] / tbody / tr [2] / td [1]
- XPath 2.0 :
// table [@ id =example] /(tbody,。)/ tr [2] / td [1]
This is meant to provide a canonical Q&A to all that similar (but much too specific questions to be a close target candidate) popping up once or twice a week.
I'm developing an application that needs to parse a website with tables in it. As deriving XPath expression for scraping web pages is boring and error-prone work, I'd like to use the XPath extractor feature of Firebug (or similar tools in other browsers) for this.
Example input looks like this:
<!-- snip -->
<table id="example">
<tr>
<th>Example Cell</th>
<th>Another one</th>
</tr>
<tr>
<td>foobar</td>
<td>42</td>
</tr>
</table>
<!-- snip -->
I want to extract the first data cell ("foobar"). Firebug proposes the XPath expression
//table[@id="example"]/tbody/tr[2]/td[1]
which works fine in any XPath tester plugins, but not my own application (no results found). If I cut down the query to //table[@id]
, it works again.
What's going wrong?
The Problem: DOM Requires <tbody/>
Tags
Firebug, Chrome's Developer Tool, XPath functions in JavaScript and others work on the DOM, not the basic HTML source code.
The DOM for HTML requires that all table rows not contained in a table header of footer (<thead/>
, <tfoot/>
) are included in table body tags <tbody/>
. Thus, browsers add this tag if it's missing while parsing (X)HTML. For example, Microsoft's DOM documentation says
The
tbody
element is exposed for all tables, even if the table does not explicitly define atbody
element.
There is an in-depth explanation in another answer on stackoverflow.
On the other hand, HTML does not necessarily require that tag to be used:
The
TBODY
start tag is always required except when the table contains only one table body and no table head or foot sections.
Most XPath Processors Work on raw XML
Excluding JavaScript, most XPath processors work on raw XML, not the DOM, thus do not add <tbody/>
tags. Also HTML parser libraries like tag-soup and htmltidy only output XHTML, not "DOM-HTML".
This is a common problem posted on Stackoverflow for PHP, Ruby, Python, Java, C#, Google Docs (Spreadsheets) and lots of others. Selenium runs inside the browser and works on the DOM -- so it is not affected!
Reproducing the Issue
Compare the source shown by Firebug (or Chrome's Dev Tools) with the one you get by right-clicking and selecting "Show Page Source" (or whatever it's called in your browsers) -- or by using curl http://your.example.org
on the command line. Latter will probably not contain any <tbody/>
elements (they're rarely used), Firebug will always show them.
Solution 1: Remove /tbody
Axis Step
Check if the table you're stuck at really does not contain a <tbody/>
element (see last paragraph). If it does, you've probably got another kind of problem.
Now remove the /tbody
axis step, so your query will look like
//table[@id="example"]/tr[2]/td[1]
Solution 2: Skip <tbody/>
Tags
This is a rather dirty solution and likely to fail for nested tables (can jump into inner tables). I would only recommend to to this in very rare cases.
Replace the /tbody
axis step by a descendant-or-self step:
//table[@id="example"]//tr[2]/td[1]
Solution 3: Allow Both Input With and Without <tbody/>
Tags
If you're not sure in advance that your table or use the query in both "HTML source" and DOM context; and don't want/cannot use the hack from solution 2, provide an alternative query (for XPath 1.0) or use an "optional" axis step (XPath 2.0 and higher).
- XPath 1.0:
//table[@id="example"]/tr[2]/td[1] | //table[@id="example"]/tbody/tr[2]/td[1]
- XPath 2.0:
//table[@id="example"]/(tbody, .)/tr[2]/td[1]
这篇关于为什么我的XPath查询(刮取HTML表格)只能在Firebug中工作,但不是我正在开发的应用程序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!