pandas read_html ValueError: 没有找到表格 [英] pandas read_html ValueError: No tables found

查看:49
本文介绍了 pandas read_html ValueError: 没有找到表格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html" 天气地下页面.我有以下代码:

I am trying to scrap the historical weather data from the "https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html" weather underground page. I have the following code:

import pandas as pd 

page_link = 'https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html'
df = pd.read_html(page_link)
print(df)

我有以下回复:

Traceback (most recent call last):
 File "weather_station_scrapping.py", line 11, in <module>
  result = pd.read_html(page_link)
 File "/anaconda3/lib/python3.6/site-packages/pandas/io/html.py", line 987, in read_html
  displayed_only=displayed_only)
 File "/anaconda3/lib/python3.6/site-packages/pandas/io/html.py", line 815, in _parse raise_with_traceback(retained)
 File "/anaconda3/lib/python3.6/site-packages/pandas/compat/__init__.py", line 403, in raise_with_traceback
  raise exc.with_traceback(traceback)
ValueError: No tables found

虽然,这个页面明明有一张表格,但它并没有被 read_html 选中.我曾尝试使用 Selenium 以便在我阅读之前可以加载页面.

Although, this page clearly has a table but it is not being picked by the read_html. I have tried using Selenium so that the page can be loaded before I read it.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Firefox()
driver.get("https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html")
elem = driver.find_element_by_id("history_table")

head = elem.find_element_by_tag_name('thead')
body = elem.find_element_by_tag_name('tbody')

list_rows = []

for items in body.find_element_by_tag_name('tr'):
    list_cells = []
    for item in items.find_elements_by_tag_name('td'):
        list_cells.append(item.text)
    list_rows.append(list_cells)
driver.close()

现在,问题是它找不到tr".我将不胜感激任何建议.

Now, the problem is that it cannot find "tr". I would appreciate any suggestions.

推荐答案

您可以使用 requests 并避免打开浏览器.

You can use requests and avoid opening browser.

您可以使用以下方法获取当前条件:

You can get current conditions by using:

https://stationdata.wunderground.com/cgi-bin/stationlookup?station=KMAHADLE7&units=both&v=2.0&format=json&callback=jQuery1720724027235122559_15427145&a

并从左侧去除 'jQuery1720724027235122559_1542743885014(') 和从右侧去除 ')' .然后处理json字符串.

and strip of 'jQuery1720724027235122559_1542743885014(' from the left and ')' from the right. Then handle the json string.

您可以通过以下方式调用 API 来获取摘要和历史记录

You can get summary and history by calling the API with the following

https://api-ak.wunderground.com/api/606f3f6977348613/history_20170201null/units:both/v:2.0/q/pws:KMAHADLE7.json?callback=jQuery17207348613/history_20170201一个>

https://api-ak.wunderground.com/api/606f3f6977348613/history_20170201null/units:both/v:2.0/q/pws:KMAHADLE7.json?callback=jQuery1720724027235122559_1542743885015&_=1542743886276

然后你需要去掉前面的 'jQuery1720724027235122559_1542743885015(') 和右边的 ');'.然后您就有了一个可以解析的 JSON 字符串.

You then need to strip 'jQuery1720724027235122559_1542743885015(' from the front and ');' from the right. You then have a JSON string you can parse.

JSON 示例:

您可以通过在浏览器中使用 F12 开发工具并检查页面加载期间创建的流量的网络选项卡来找到这些 URL.

You can find these URLs by using F12 dev tools in browser and inspecting the network tab for the traffic created during page load.

current 的示例,注意到 JSON 中的 nulls 似乎存在问题,因此我将替换为 "placeholder":

An example for current, noting there seems to be a problem with nulls in the JSON so I am replacing with "placeholder":

import requests
import pandas as pd
import json
from pandas.io.json import json_normalize
from bs4 import BeautifulSoup

url = 'https://stationdata.wunderground.com/cgi-bin/stationlookup?station=KMAHADLE7&units=both&v=2.0&format=json&callback=jQuery1720724027235122559_1542743885014&_=15'
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
s = soup.select('html')[0].text.strip('jQuery1720724027235122559_1542743885014(').strip(')')
s = s.replace('null','"placeholder"')
data= json.loads(s)
data = json_normalize(data)
df = pd.DataFrame(data)
print(df)

这篇关于 pandas read_html ValueError: 没有找到表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆