Python:通过xpath获取html表数据 [英] Python: Get html table data by xpath

查看:49
本文介绍了Python:通过xpath获取html表数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我觉得从html表中提取数据非常困难,并且需要为每个站点进行自定义构建..我非常希望在这里被证明是错误的..

I feel that extracting data from html tables is extremely difficult and requires custom build for each site.. I would very much like to be proved wrong here..

是否有一种简单的pythonic方法,仅通过使用感兴趣表的url和xpath即可从网站中提取字符串和数字?

Is there an simple pythonic way to extract strings and numbers out of a website by just using the url and xpath of the table of interest?

示例:

url_str = 'http://www.fdmbenzinpriser.dk/searchprices/5/'
xpath_str = //*[@id="sortabletable"]

我曾经有一个脚本可以从该站点获取数据.但是丢了.我记得它当时使用的是标签"和一些字符串逻辑...不是很漂亮

I once had a script that could fetch data from this site. But lost it. As I recall it I was using the tag '' and some string logic.. not very pretty

我知道像 thingspeak 之类的网站可以做到这些..

I know that sites like thingspeak can do these things..

推荐答案

有一个相当通用的模式,您可以用来解析许多模式,尽管不是全部,表格.

There is a fairly general pattern which you could use to parse many, though not all, tables.

import lxml.html as LH
import requests
import pandas as pd
def text(elt):
    return elt.text_content().replace(u'\xa0', u' ')

url = 'http://www.fdmbenzinpriser.dk/searchprices/5/'
r = requests.get(url)
root = LH.fromstring(r.content)

for table in root.xpath('//table[@id="sortabletable"]'):
    header = [text(th) for th in table.xpath('//th')]        # 1
    data = [[text(td) for td in tr.xpath('td')]  
            for tr in table.xpath('//tr')]                   # 2
    data = [row for row in data if len(row)==len(header)]    # 3 
    data = pd.DataFrame(data, columns=header)                # 4
    print(data)

  1. 您可以使用 table.xpath('//th')查找列名.
  2. table.xpath('//tr')返回行,对于每一行, tr.xpath('td')返回表示表的一个单元格"的元素.
  3. 有时您可能需要过滤掉某些行,例如在这种情况下,行值比标题少.
  4. 如何处理数据(列表列表)由您决定.在这里,我仅将熊猫用作演示文稿:
  1. You can use table.xpath('//th') to find the column names.
  2. table.xpath('//tr') returns the rows, and for each row, tr.xpath('td') returns the element representing one "cell" of the table.
  3. Sometimes you may need to filter out certain rows, such as in this case, rows with fewer values than the header.
  4. What you do with the data (a list of lists) is up to you. Here I use Pandas for presentation only:


        Pris                               Adresse       Tidspunkt
0       8.04           Brovejen 18 5500 Middelfart   3 min 38 sek 
1       7.88         Hovedvejen 11 5500 Middelfart   4 min 52 sek 
2       7.88         Assensvej 105 5500 Middelfart   5 min 56 sek 
3       8.23    Ejby Industrivej 111 2600 Glostrup   6 min 28 sek 
4       8.15            Park Alle 125 2605 Brøndby  25 min 21 sek 
5       8.09           Sletvej 36 8310 Tranbjerg J  25 min 34 sek 
6       8.24      Vindinggård Center 29 7100 Vejle   27 min 6 sek 
7     7.99 *         Søndergade 116 8620 Kjellerup  31 min 27 sek 
8     7.99 *   Gertrud Rasks Vej 1 9210 Aalborg SØ  31 min 27 sek 
9     7.99 *              Sorøvej 13 4200 Slagelse  31 min 27 sek 

这篇关于Python:通过xpath获取html表数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆