为 html 表提取 lxml xpath [英] Extracting lxml xpath for html table

查看:27
本文介绍了为 html 表提取 lxml xpath的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个类似于以下内容的 html 文档:

I have a html doc similar to following:

<html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml">
    <div id="Symbols" class="cb">
    <table class="quotes">
    <tr><th>Code</th><th>Name</th>
        <th style="text-align:right;">High</th>
        <th style="text-align:right;">Low</th>
    </tr>
    <tr class="ro" onclick="location.href='/xyz.com/A.htm';" style="color:red;">
        <td><a href="/xyz.com/A.htm" title="Display,A">A</a></td>
        <td>A Inc.</td>
        <td align="right">45.44</td>
        <td align="right">44.26</td>
    <tr class="re" onclick="location.href='/xyz.com/B.htm';" style="color:red;">
        <td><a href="/xyz.com/B.htm" title="Display,B">B</a></td>
        <td>B Inc.</td>
        <td align="right">18.29</td>
        <td align="right">17.92</td>
</div></html>

我需要从表中提取code/name/high/low信息.

I need to extract code/name/high/low information from the table.

我使用了 Stack Over Flow 中一个类似示例中的以下代码:

I used following code from one of the similar examples in Stack Over Flow:

#############################
import urllib2
from lxml import html, etree

webpg = urllib2.urlopen(http://www.eoddata.com/stocklist/NYSE/A.htm).read()
table = html.fromstring(webpg)

for row in table.xpath('//table[@class="quotes"]/tbody/tr'):
    for column in row.xpath('./th[position()>0]/text() | ./td[position()=1]/a/text() | ./td[position()>1]/text()'):
        print column.strip(),
    print

#############################

我没有得到任何输出.我必须将第一个循环 xpath 从 table.xpath('//table[@class="quotes"]/tbody/tr 更改为 table.xpath('//tr')')

I am getting nothing output. I have to change the first loop xpath to table.xpath('//tr') from table.xpath('//table[@class="quotes"]/tbody/tr')

我只是不明白为什么 xpath('//table[@class="quotes"]/tbody/tr') 不起作用.

I just don't understand why the xpath('//table[@class="quotes"]/tbody/tr') not work.

推荐答案

您可能正在 Firebug 中查看 HTML,对吗?当文档中不存在隐式标签 时,浏览器将插入该标签.lxml 库只会处理原始 HTML 字符串中存在的标签.

You are probably looking at the HTML in Firebug, correct? The browser will insert the implicit tag <tbody> when it is not present in the document. The lxml library will only process the tags present in the raw HTML string.

省略 XPath 中的 tbody 级别.例如,这有效:

Omit the tbody level in your XPath. For example, this works:

tree = lxml.html.fromstring(raw_html)
tree.xpath('//table[@class="quotes"]/tr')
[<Element tr at 1014206d0>, <Element tr at 101420738>, <Element tr at 1014207a0>]

这篇关于为 html 表提取 lxml xpath的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆