Python:通过xpath获取html表数据 [英] Python: Get html table data by xpath
问题描述
我觉得从html表中提取数据非常困难,并且需要为每个站点进行自定义构建..我非常希望在这里被证明是错误的..
I feel that extracting data from html tables is extremely difficult and requires custom build for each site.. I would very much like to be proved wrong here..
是否有一种简单的pythonic方法,仅通过使用感兴趣表的url和xpath即可从网站中提取字符串和数字?
Is there an simple pythonic way to extract strings and numbers out of a website by just using the url and xpath of the table of interest?
示例:
url_str = 'http://www.fdmbenzinpriser.dk/searchprices/5/'
xpath_str = //*[@id="sortabletable"]
我曾经有一个脚本可以从该站点获取数据.但是丢了.我记得它当时使用的是标签"和一些字符串逻辑...不是很漂亮
I once had a script that could fetch data from this site. But lost it. As I recall it I was using the tag '' and some string logic.. not very pretty
我知道像 thingspeak 之类的网站可以做到这些..
I know that sites like thingspeak can do these things..
推荐答案
有一个相当通用的模式,您可以用来解析许多模式,尽管不是全部,表格.
There is a fairly general pattern which you could use to parse many, though not all, tables.
import lxml.html as LH
import requests
import pandas as pd
def text(elt):
return elt.text_content().replace(u'\xa0', u' ')
url = 'http://www.fdmbenzinpriser.dk/searchprices/5/'
r = requests.get(url)
root = LH.fromstring(r.content)
for table in root.xpath('//table[@id="sortabletable"]'):
header = [text(th) for th in table.xpath('//th')] # 1
data = [[text(td) for td in tr.xpath('td')]
for tr in table.xpath('//tr')] # 2
data = [row for row in data if len(row)==len(header)] # 3
data = pd.DataFrame(data, columns=header) # 4
print(data)
- 您可以使用
table.xpath('//th')
查找列名. -
table.xpath('//tr')
返回行,对于每一行,tr.xpath('td')
返回表示表的一个单元格"的元素. - 有时您可能需要过滤掉某些行,例如在这种情况下,行值比标题少.
- 如何处理数据(列表列表)由您决定.在这里,我仅将熊猫用作演示文稿:
- You can use
table.xpath('//th')
to find the column names. table.xpath('//tr')
returns the rows, and for each row,tr.xpath('td')
returns the element representing one "cell" of the table.- Sometimes you may need to filter out certain rows, such as in this case, rows with fewer values than the header.
- What you do with the data (a list of lists) is up to you. Here I use Pandas for presentation only:
Pris Adresse Tidspunkt
0 8.04 Brovejen 18 5500 Middelfart 3 min 38 sek
1 7.88 Hovedvejen 11 5500 Middelfart 4 min 52 sek
2 7.88 Assensvej 105 5500 Middelfart 5 min 56 sek
3 8.23 Ejby Industrivej 111 2600 Glostrup 6 min 28 sek
4 8.15 Park Alle 125 2605 Brøndby 25 min 21 sek
5 8.09 Sletvej 36 8310 Tranbjerg J 25 min 34 sek
6 8.24 Vindinggård Center 29 7100 Vejle 27 min 6 sek
7 7.99 * Søndergade 116 8620 Kjellerup 31 min 27 sek
8 7.99 * Gertrud Rasks Vej 1 9210 Aalborg SØ 31 min 27 sek
9 7.99 * Sorøvej 13 4200 Slagelse 31 min 27 sek
这篇关于Python:通过xpath获取html表数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!