如何使用lxml解析HTML表格与变量列表? [英] How to parse HTML table against a list of variables using lxml?
问题描述
我试图用lxml解析一个HTML表格。虽然 rows = outhtml.xpath('// tr / td / span [@ class =boldred] / text()')
取得结果,我试图仅当我的配置文件中的变量开始时才提取列内容。例如,如果< td>
以'Street 1'开头,那么我想抓住< span>
< td>
标记的内容。这样,我可以有一个元组的元组(它处理None值),然后我可以将它存储在数据库中。
I am trying to parse an HTML table using lxml. While rows = outhtml.xpath('//tr/td/span[@class="boldred"]/text()')
fetches the results, I am trying to extract the column contents only when it starts with a variable in my config file. For instance, if a <td>
starts with 'Street 1', I then want to grab the <span>
contents of that <td>
tag. This way, I can have a tuple of tuples (which takes care of the None values) which I can then store in the database.
lxml_parse.py
lxml_parse.py
import lxml.html as lh
doc=open('test.htm', 'r')
outhtml=lh.parse(doc)
doc.close()
rows = outhtml.xpath('//tr/td/span[@class="boldred"]/text()')
print rows
test.htm
test.htm
<tr>
<td></td>
<td colspan="2">
Street 1:<span class="required"> *</span><br />
<span class="boldred">2100 5th Ave</span>
</td>
<td colspan="2">
Street 2:<br />
<span class="boldred">Ste 202</span>
</td>
</tr>
<tr>
<td></td>
<td>
City:<span class="required"> *</span><br />
<span class="boldred">NYC</span>
</td>
<td>
State:<br />
<SPAN CLASS="boldred2"></SPAN><br/><SPAN CLASS="boldred">NY</SPAN>
</td>
<td>
Country:<span class="required"> *</span><br />
<SPAN CLASS="boldred2"></SPAN><br/><SPAN CLASS="boldred">USA</SPAN>
</td>
<td>
Zip:<br />
<span class="boldred">10022</span>
</td>
</tr>
输出:
Output :
$ python lxml_parse.py
['2100 5th Ave', 'Ste 202', 'NYC', 'NY', 'USA', '10022']
解析一堆变量是我遇到的问题:
import lxml.html as lh
desiredvars = ['Street 1','Street 2','City', 'State', 'Zip']
doc=open('test.htm', 'r')
outhtml=lh.parse(doc)
doc.close()
myresultset = ((var, outhtml.xpath('//tr/td[child::*[text()=var]]/span[@class="boldred"]/text()')) for var in desiredvars)
print myresultset
推荐答案
lxml_tempsofsol.py :
import lxml.html as lh
desiredvars = ['Street 1','Street 2','City', 'State', 'Zip']
doc=open('test.htm', 'r')
outhtml=lh.parse(doc)
doc.close()
myresultset = ((var, outhtml.xpath('//tr/td[contains(text(), "%s")]/span[@class="boldred"]/text()'%(var))[0]) for var in desiredvars)
for each in myresultset:
print each
输出:
$ python lxml_tempsofsol.py
('Street 1', '2100 5th Ave')
('Street 2', 'Ste 202')
('City', 'NYC')
('State', 'NY')
('Zip', '10022')
这篇关于如何使用lxml解析HTML表格与变量列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!