为html表格提取lxml xpath [英] Extracting lxml xpath for html table
问题描述
我有一个类似于以下的html文档:
< html xmlns =http://www.w3。 org / 1999 / xhtmlxmlns =http://www.w3.org/1999/xhtml>
< div id =Symbolsclass =cb>
< table class =quotes>
< tr>第>代码< th>第< th>名称< / th>
< th style =text-align:right;> High< / th>
< th style =text-align:right;> Low< / th>
< / tr>
< tr class =roonclick =location.href ='/ xyz.com/A.htm';风格= 颜色:红; >
< td>< a href =/ xyz.com/A.htmtitle =Display,A> A< / a>< / td>
< td> A Inc.< / td>
< td align =right> 45.44< / td>
< td align =right> 44.26< / td>
< tr class =reonclick =location.href ='/ xyz.com/B.htm';风格= 颜色:红; >
< td>< a href =/ xyz.com/B.htmtitle =Display,B> B< / a>< / td>
< td> B Inc.< / td>
< td align =right> 18.29< / td>
< td align =right> 17.92< / td>
< / div>< / html>
我需要提取 我使用了Stack Over Flow中的一个类似示例中的以下代码: 我没有得到任何输出。我必须将第一个循环xpath从 我只是不明白为什么 code / name / high / low $ c
#############################
导入urllib2
from lxml import html,etree
webpg = urllib2.urlopen(http://www.eoddata.com/stocklist/NYSE/A.htm).read()
table = html .string(webpg)
for table.xpath('// table [@ class =quotes] / tbody / tr'):
for row.x './th[position()> 0] / text()| ./td[position()= 1] / a / text()| ./td[position()> 1] / text()' ):
print column.strip(),
print
######################## #####
table.xpath('// table [@class =quotes] / tbody / tr')
xpath(' // table [@ class =quotes] / tbody / tr')
not work。
<您可能正在查看Firebug中的HTML,对吗?当文档不存在时,浏览器将插入隐式标签< tbody>
。 lxml库仅处理原始HTML字符串中的标记。
省略XPath中的 tbody 级别。例如,这可以工作:
tree = lxml.html.fromstring(raw_html)
tree.xpath('/ / table [@class =quotes] / tr')
[<元素tr在1014206d0>,<元素tr在101420738>,<元素tr在1014207a0>]
I have a html doc similar to following:
<html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml">
<div id="Symbols" class="cb">
<table class="quotes">
<tr><th>Code</th><th>Name</th>
<th style="text-align:right;">High</th>
<th style="text-align:right;">Low</th>
</tr>
<tr class="ro" onclick="location.href='/xyz.com/A.htm';" style="color:red;">
<td><a href="/xyz.com/A.htm" title="Display,A">A</a></td>
<td>A Inc.</td>
<td align="right">45.44</td>
<td align="right">44.26</td>
<tr class="re" onclick="location.href='/xyz.com/B.htm';" style="color:red;">
<td><a href="/xyz.com/B.htm" title="Display,B">B</a></td>
<td>B Inc.</td>
<td align="right">18.29</td>
<td align="right">17.92</td>
</div></html>
I need to extract code/name/high/low
information from the table.
I used following code from one of the similar examples in Stack Over Flow:
#############################
import urllib2
from lxml import html, etree
webpg = urllib2.urlopen(http://www.eoddata.com/stocklist/NYSE/A.htm).read()
table = html.fromstring(webpg)
for row in table.xpath('//table[@class="quotes"]/tbody/tr'):
for column in row.xpath('./th[position()>0]/text() | ./td[position()=1]/a/text() | ./td[position()>1]/text()'):
print column.strip(),
print
#############################
I am getting nothing output. I have to change the first loop xpath to table.xpath('//tr')
from table.xpath('//table[@class="quotes"]/tbody/tr')
I just don't understand why the xpath('//table[@class="quotes"]/tbody/tr')
not work.
You are probably looking at the HTML in Firebug, correct? The browser will insert the implicit tag <tbody>
when it is not present in the document. The lxml library will only process the tags present in the raw HTML string.
Omit the tbody level in your XPath. For example, this works:
tree = lxml.html.fromstring(raw_html)
tree.xpath('//table[@class="quotes"]/tr')
[<Element tr at 1014206d0>, <Element tr at 101420738>, <Element tr at 1014207a0>]
这篇关于为html表格提取lxml xpath的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!