为html表格提取lxml xpath [英] Extracting lxml xpath for html table

查看:136
本文介绍了为html表格提取lxml xpath的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个类似于以下的html文档:

 < html xmlns =http://www.w3。 org / 1999 / xhtmlxmlns =http://www.w3.org/1999/xhtml> 
< div id =Symbolsclass =cb>
< table class =quotes>
< tr>第>代码< th>第< th>名称< / th>
< th style =text-align:right;> High< / th>
< th style =text-align:right;> Low< / th>
< / tr>
< tr class =roonclick =location.href ='/ xyz.com/A.htm';风格= 颜色:红; >
< td>< a href =/ xyz.com/A.htmtitle =Display,A> A< / a>< / td>
< td> A Inc.< / td>
< td align =right> 45.44< / td>
< td align =right> 44.26< / td>
< tr class =reonclick =location.href ='/ xyz.com/B.htm';风格= 颜色:红; >
< td>< a href =/ xyz.com/B.htmtitle =Display,B> B< / a>< / td>
< td> B Inc.< / td>
< td align =right> 18.29< / td>
< td align =right> 17.92< / td>
< / div>< / html>

我需要提取 code / name / high / low

我使用了Stack Over Flow中的一个类似示例中的以下代码:

  ############################# 
导入urllib2
from lxml import html,etree

webpg = urllib2.urlopen(http://www.eoddata.com/stocklist/NYSE/A.htm).read()
table = html .string(webpg)

for table.xpath('// table [@ class =quotes] / tbody / tr'):
for row.x './th[position()> 0] / text()| ./td[position()= 1] / a / text()| ./td[position()> 1] / text()' ):
print column.strip(),
print

######################## #####

我没有得到任何输出。我必须将第一个循环xpath从 table.xpath('// table [@class =quotes] / tbody / tr')



我只是不明白为什么 xpath(' // table [@ class =quotes] / tbody / tr') not work。

解决方案

<您可能正在查看Firebug中的HTML,对吗?当文档不存在时,浏览器将插入隐式标签< tbody> 。 lxml库仅处理原始HTML字符串中的标记。



省略XPath中的 tbody 级别。例如,这可以工作:

  tree = lxml.html.fromstring(raw_html)
tree.xpath('/ / table [@class =quotes] / tr')
[<元素tr在1014206d0>,<元素tr在101420738>,<元素tr在1014207a0>]


I have a html doc similar to following:

<html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml">
    <div id="Symbols" class="cb">
    <table class="quotes">
    <tr><th>Code</th><th>Name</th>
        <th style="text-align:right;">High</th>
        <th style="text-align:right;">Low</th>
    </tr>
    <tr class="ro" onclick="location.href='/xyz.com/A.htm';" style="color:red;">
        <td><a href="/xyz.com/A.htm" title="Display,A">A</a></td>
        <td>A Inc.</td>
        <td align="right">45.44</td>
        <td align="right">44.26</td>
    <tr class="re" onclick="location.href='/xyz.com/B.htm';" style="color:red;">
        <td><a href="/xyz.com/B.htm" title="Display,B">B</a></td>
        <td>B Inc.</td>
        <td align="right">18.29</td>
        <td align="right">17.92</td>
</div></html>

I need to extract code/name/high/low information from the table.

I used following code from one of the similar examples in Stack Over Flow:

#############################
import urllib2
from lxml import html, etree

webpg = urllib2.urlopen(http://www.eoddata.com/stocklist/NYSE/A.htm).read()
table = html.fromstring(webpg)

for row in table.xpath('//table[@class="quotes"]/tbody/tr'):
    for column in row.xpath('./th[position()>0]/text() | ./td[position()=1]/a/text() | ./td[position()>1]/text()'):
        print column.strip(),
    print

#############################

I am getting nothing output. I have to change the first loop xpath to table.xpath('//tr') from table.xpath('//table[@class="quotes"]/tbody/tr')

I just don't understand why the xpath('//table[@class="quotes"]/tbody/tr') not work.

解决方案

You are probably looking at the HTML in Firebug, correct? The browser will insert the implicit tag <tbody> when it is not present in the document. The lxml library will only process the tags present in the raw HTML string.

Omit the tbody level in your XPath. For example, this works:

tree = lxml.html.fromstring(raw_html)
tree.xpath('//table[@class="quotes"]/tr')
[<Element tr at 1014206d0>, <Element tr at 101420738>, <Element tr at 1014207a0>]

这篇关于为html表格提取lxml xpath的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆