来自网站[timeanddate.com]的抓取表格 [英] Scraping table from website [timeanddate.com]

查看:108
本文介绍了来自网站[timeanddate.com]的抓取表格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从 https://www.timeanddate.com/中获取历史每小时天气数据.

I want to get the historical hourly weather data from https://www.timeanddate.com/

这是网站链接: https://www.timeanddate.com/weather/usa/dayton/historic?month=2&year=2016 -在这里,我选择的是2月和2016年,结果将显示在页面底部.

This is the website link:https://www.timeanddate.com/weather/usa/dayton/historic?month=2&year=2016 - Here I am selecting February and 2016, and the result will appear in the bottom of the page.

我使用了以下步骤: https://stackoverflow.com/a/47280970/9341589

它在"每个月的第一天"上运行良好,我想分析所有月份,以及全年(如果可能).

and it is working perfectly on the "first day of each month", I want to parse all the month, and if it is possible all the year.

下面是我正在使用的代码(用于解析2016年3月1日):

below the code I am using (to parse March 1, 2016):

from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.timeanddate.com/weather/usa/dayton/historic?month=3&year=2016"
page = urlopen(url)
soup = BeautifulSoup(page, "html.parser")

Data = []
table = soup.find('table', attrs={'id':'wt-his'})
for tr in table.find('tbody').find_all('tr'):
   dict = {}
   dict['time'] = tr.find('th').text.strip()
   all_td = tr.find_all('td')
   dict['temp'] = all_td[1].text
   dict['weather'] = all_td[2].text
   dict['wind'] = all_td[3].text
   arrow = all_td[4].text


   dict['humidity'] = all_td[5].text
   dict['barometer'] = all_td[6].text
   dict['visibility'] = all_td[7].text

   Data.append(dict)

这是3月1日的结果:

this is the result for March 1:

这是因为网站"URL",链接仅包含月份和年份,并且要更改日期,例如,从2月1日到2月3日,该标签显示在需要使用的图片中: a href ="https://i.stack.imgur.com/nQEWp.png" rel ="nofollow noreferrer">

This is because the website "url", the link only include the month and year, and to change the days, for instance, from Feb1 to Feb 3, the tab is shown in the pic attached needed to be used:

推荐答案

您可以为单个页面遍历表元素(trthtd):

You can iterate over the table elements (tr, th, and td) for a single page:

import requests, re, typing
from bs4 import BeautifulSoup as soup
import contextlib
def _remove(d:list) -> list:
   return list(filter(None, [re.sub('\xa0', '', b) for b in d]))

@contextlib.contextmanager
def get_weather_data(url:str, by_url = True) -> typing.Generator[dict, None, None]:
   d = soup(requests.get(url).text if by_url else url, 'html.parser')
   _table = d.find('table', {'id':'wt-his'})
   _data = [[[i.text for i in c.find_all('th')], *[i.text for i in c.find_all('td')]] for c in _table.find_all('tr')]
   [h1], [h2], *data, _ = _data
   _h2 = _remove(h2)
   yield {tuple(_remove(h1)):[dict(zip(_h2, _remove([a, *i]))) for [[a], *i] in data]}


with get_weather_data('https://www.timeanddate.com/weather/usa/dayton/historic?month=2&year=2016') as weather:
 print(weather)

输出:

{('Conditions', 'Comfort'): [{'Time': '12:58 amMon, Feb 1', 'Temp': '50°F', 'Weather': 'Light rain. Mostly cloudy.', 'Wind': '13 mph', 'Humidity': '↑', 'Barometer': '88%', 'Visibility': '29.79 "Hg'}, {'Time': '1:58 am', 'Temp': '46°F', 'Weather': 'Mostly cloudy.', 'Wind': '12 mph', 'Humidity': '↑', 'Barometer': '83%', 'Visibility': '29.82 "Hg'}, {'Time': '2:58 am', 'Temp': '43°F', 'Weather': 'Mostly cloudy.', 'Wind': '14 mph', 'Humidity': '↑', 'Barometer': '85%', 'Visibility': '29.87 "Hg'}, {'Time': '3:58 am', 'Temp': '42°F', 'Weather': 'Mostly cloudy.', 'Wind': '10 mph', 'Humidity': '↑', 'Barometer': '83%', 'Visibility': '29.89 "Hg'}, {'Time': '4:58 am', 'Temp': '41°F', 'Weather': 'Mostly cloudy.', 'Wind': '10 mph', 'Humidity': '↑', 'Barometer': '82%', 'Visibility': '29.91 "Hg'}, {'Time': '5:58 am', 'Temp': '39°F', 'Weather': 'Mostly cloudy.', 'Wind': '8 mph', 'Humidity': '↑', 'Barometer': '83%', 'Visibility': '29.93 "Hg'}, {'Time': '6:58 am', 'Temp': '38°F', 'Weather': 'Partly cloudy.', 'Wind': '5 mph', 'Humidity': '↑', 'Barometer': '82%', 'Visibility': '29.96 "Hg'}, {'Time': '7:58 am', 'Temp': '38°F', 'Weather': 'Partly sunny.', 'Wind': '5 mph', 'Humidity': '↑', 'Barometer': '80%', 'Visibility': '29.99 "Hg'}, {'Time': '8:58 am', 'Temp': '38°F', 'Weather': 'Overcast.', 'Wind': '5 mph', 'Humidity': '↑', 'Barometer': '78%', 'Visibility': '30.01 "Hg'}, {'Time': '9:58 am', 'Temp': '40°F', 'Weather': 'Broken clouds.', 'Wind': '7 mph', 'Humidity': '↑', 'Barometer': 'N/A', 'Visibility': '30.01 "Hg'}, {'Time': '10:58 am', 'Temp': '41°F', 'Weather': 'Broken clouds.', 'Wind': '1 mph', 'Humidity': '↑', 'Barometer': '72%', 'Visibility': '30.02 "Hg'}, {'Time': '11:58 am', 'Temp': '41°F', 'Weather': 'Partly sunny.', 'Wind': '2 mph', 'Humidity': '↑', 'Barometer': '70%', 'Visibility': '30.04 "Hg'}, {'Time': '12:58 pm', 'Temp': '42°F', 'Weather': 'Scattered clouds.', 'Wind': '2 mph', 'Humidity': '↑', 'Barometer': '69%', 'Visibility': '30.04 "Hg'}, {'Time': '1:58 pm', 'Temp': '43°F', 'Weather': 'Partly sunny.', 'Wind': '3 mph', 'Humidity': '↑', 'Barometer': '65%', 'Visibility': '30.03 "Hg'}, {'Time': '2:58 pm', 'Temp': '44°F', 'Weather': 'Partly sunny.', 'Wind': 'No wind', 'Humidity': '↑', 'Barometer': '62%', 'Visibility': '30.02 "Hg'}, {'Time': '3:58 pm', 'Temp': '46°F', 'Weather': 'Passing clouds.', 'Wind': '6 mph', 'Humidity': '↑', 'Barometer': '58%', 'Visibility': '30.03 "Hg'}, {'Time': '4:58 pm', 'Temp': '46°F', 'Weather': 'Sunny.', 'Wind': '6 mph', 'Humidity': '↑', 'Barometer': '57%', 'Visibility': '30.04 "Hg'}, {'Time': '5:58 pm', 'Temp': '43°F', 'Weather': 'Clear.', 'Wind': '3 mph', 'Humidity': '↑', 'Barometer': '65%', 'Visibility': '30.06 "Hg'}, {'Time': '6:58 pm', 'Temp': '39°F', 'Weather': 'Clear.', 'Wind': '1 mph', 'Humidity': '↑', 'Barometer': '71%', 'Visibility': '30.09 "Hg'}, {'Time': '7:58 pm', 'Temp': '35°F', 'Weather': 'Clear.', 'Wind': '1 mph', 'Humidity': '↑', 'Barometer': '79%', 'Visibility': '30.11 "Hg'}, {'Time': '8:58 pm', 'Temp': '32°F', 'Weather': 'Clear.', 'Wind': 'No wind', 'Humidity': '↑', 'Barometer': '85%', 'Visibility': '30.13 "Hg'}, {'Time': '9:58 pm', 'Temp': '30°F', 'Weather': 'Clear.', 'Wind': 'No wind', 'Humidity': '↑', 'Barometer': '91%', 'Visibility': '30.14 "Hg'}, {'Time': '10:58 pm', 'Temp': '28°F', 'Weather': 'Clear.', 'Wind': '5 mph', 'Humidity': '↑', 'Barometer': '93%', 'Visibility': '30.14 "Hg'}, {'Time': '11:58 pm', 'Temp': '29°F', 'Weather': 'Clear.', 'Wind': 'No wind', 'Humidity': '↑', 'Barometer': '90%', 'Visibility': '30.13 "Hg'}]}

但是,为了在所需月份的所有日期中抓取数据,必须使用selenium,因为该站点通过对后端的请求来动态更新DOM:

However, in order to scrape the data for all days in the desired month, selenium must be used, as the site dynamically updates the DOM via a request to the backend:

from selenium import webdriver
d = webdriver.Chrome('/Path/to/chromedriver')
d.get('https://www.timeanddate.com/weather/usa/dayton/historic?month=2&year=2016')
_d = {}
for i in d.find_element_by_id('wt-his-select').find_elements_by_tag_name('option'):
  i.click()
  with get_weather_data(d.page_source, False) as weather:
    _d[i.text] = weather

要遍历完整数据结果,请使用dict.items:

to iterate over the full data results, use dict.items:

for a, b in _d.items():
  pass #do something with a and b

这篇关于来自网站[timeanddate.com]的抓取表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆