从网站 [timeanddate.com] 抓取表格 [英] Scraping table from website [timeanddate.com]

查看:15
本文介绍了从网站 [timeanddate.com] 抓取表格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从

这是因为网站url",链接只包含月份和年份,并且要更改日期,例如从2月1日到2月3日,需要使用的标签显示在附图中:

解决方案

您可以迭代表元素(trthtd) 用于单个页面:

导入请求,重新,输入从 bs4 导入 BeautifulSoup 作为汤导入上下文库def _remove(d:list) ->列表:return list(filter(None, [re.sub('xa0', '', b) for b in d]))@contextlib.contextmanagerdef get_weather_data(url:str, by_url = True) ->打字.发电机[字典,无,无]:d = 汤(requests.get(url).text if by_url else url, 'html.parser')_table = d.find('table', {'id':'wt-his'})_data = [[[i.text for i in c.find_all('th')], *[i.text for i in c.find_all('td')]] for c in _table.find_all('tr')][h1], [h2], *data, _ = _data_h2 = _remove(h2)产生 {tuple(_remove(h1)):[dict(zip(_h2, _remove([a, *i]))) for [[a], *i] in data]}使用 get_weather_data('https://www.timeanddate.com/weather/usa/dayton/historic?month=2&year=2016') 作为天气:打印(天气)

输出:

{('Conditions', 'Comfort'): [{'Time': '12:58 amMon, Feb 1', 'Temp': '50°F', 'Weather': '小雨.多云.', 'Wind': '13 mph', 'Humidity': '↑', 'Barometer': '88%', 'Visibility': '29.79 "Hg'}, {'Time': '1:58 am', 'Temp': '46°F', 'Weather': '多云', 'Wind': '12 mph', 'Humidity': '↑', 'Barometer': '83%','能见度':'29.82Hg'},{'时间':'2:58 am','Temp':'43°F','天气':'多云.','风':'14 mph', '湿度': '↑', '气压计': '85%', '能见度': '29.87 "Hg'}, {'Time': '3:58 am', 'Temp': '42°F', 'Weather': '多云', 'Wind': '10 mph', 'Humidity': '↑', 'Barometer': '83%', 'Visibility': '29.89 "Hg'}, {'时间':'4:58 am','Temp':'41°F','天气':'多云.','风':'10 mph','湿度':'↑','气压计': '82%', 'Visibility': '29.91 "Hg'}, {'Time': '5:58 am', 'Temp': '39°F', 'Weather': '多云.','风':'8 mph','湿度':'↑','气压计':'83%','能见度':'29.93 "Hg'},{'时间':'6:58 am','Temp': '38°F', 'Weather': '部分多云', 'Wind': '5 mph', 'Humidity': '↑', 'Barometer': '82%', 'Visibility': '29.96 "汞"}, {'时间': '7:58am', 'Temp': '38°F', 'Weather': '部分晴天', 'Wind': '5 mph', 'Humidity': '↑', 'Barometer': '80%', '能见度': '29.99 "Hg'}, {'Time': '8:58 am', 'Temp': '38°F', 'Weather': '阴天', 'Wind': '5 mph','湿度':'↑','气压计':'78%','能见度':'30.01 "Hg'},{'时间':'9:58 am','温度':'40°F','天气':'破碎的云彩.','风':'7 mph','湿度':'↑','气压计':'N/A','能见度':'30.01 "Hg'},{'时间':'10:58 am','Temp':'41°F','天气':'破碎的云层.','风':'1 mph','湿度':'↑','气压计': '72%', 'Visibility': '30.02 "Hg'}, {'Time': '11:58 am', 'Temp': '41°F', 'Weather': '部分晴天.', 'Wind': '2 mph', 'Humidity': '↑', 'Barometer': '70%', 'Visibility': '30.04 "Hg'}, {'Time': '12:58 pm', 'Temp':'42°F','天气':'散布的云.','风':'2 mph','湿度':'↑','气压计':'69%','能见度':'30.04"Hg'}, {'Time': '1:58 pm', 'Temp': '43°F', 'Weather': '部分晴天.', 'Wind': '3 mph', 'Humidity':'↑', '气压计': '65%', '能见度': '30.03 "Hg'}, {'Time': '2:58 pm', 'Temp': '44°F', 'Weather': '部分晴天.', 'Wind': '无风', '湿度': '↑', 'Barometer': '62%', 'Visibility': '30.02 "Hg'}, {'Time': '3:58 pm', 'Temp': '46°F', 'Weather': '掠过云层', 'Wind': '6 mph', 'Humidity': '↑', 'Barometer': '58%', 'Visibility': '30.03 "Hg'}, {'Time': '下午 4 点 58 分'、'温度':'46°F'、'天气':'晴天'、'风':'6 英里/小时'、'湿度':'↑'、'气压计':'57%', '能见度': '30.04 "Hg'}, {'Time': '5:58 pm', 'Temp': '43°F', 'Weather': 'Clear.', 'Wind': '3 mph', '湿度': '↑', '气压计': '65%', '能见度': '30.06 "Hg'}, {'Time': '6:58 pm', 'Temp': '39°F', 'Weather': 'Clear.', 'Wind': '1 mph', 'Humidity': '↑', 'Barometer': '71%', 'Visibility': '30.09 "Hg'}, {'时间':'7:58 pm','温度':'35°F','天气':'晴朗','风':'1 mph','湿度':'↑','气压计':'79%', 'Visibility': '30.11 "Hg'}, {'Time': '8:58 pm', 'Temp': '32°F', 'Weather': 'Clear.', 'Wind':无风",湿度":↑",气压计":85%",能见度":30.13"Hg'}, {'Time': '9:58 pm', 'Temp': '30°F', 'Weather': 'Clear.', 'Wind': '无风', 'Humidity': '↑', 'Barometer': '91%', 'Visibility': '30.14 "Hg'}, {'Time': '10:58 pm', 'Temp': '28°F', 'Weather': 'Clear.', 'Wind': '5 mph', 'Humidity': '↑', 'Barometer': '93%', 'Visibility': '30.14 "Hg'}, {'Time': '11:58 pm', 'Temp': '29°F', 'Weather': 'Clear.', 'Wind': '无风', 'Humidity': '↑', 'Barometer': '90%', 'Visibility': '30.13 "汞'}]}

但是,为了抓取所需月份中所有天的数据,必须使用 selenium,因为该站点通过对后端的请求动态更新 DOM:

from selenium import webdriverd = webdriver.Chrome('/Path/to/chromedriver')d.get('https://www.timeanddate.com/weather/usa/dayton/historic?month=2&year=2016')_d = {}对于我在 d.find_element_by_id('wt-his-select').find_elements_by_tag_name('option'):i.click()使用 get_weather_data(d.page_source, False) 作为天气:_d[i.text] = 天气

迭代完整的数据结果,使用dict.items:

for a, b in _d.items():pass #用a和b做点什么

I want to get the historical hourly weather data from https://www.timeanddate.com/

This is the website link:https://www.timeanddate.com/weather/usa/dayton/historic?month=2&year=2016 - Here I am selecting February and 2016, and the result will appear in the bottom of the page.

I used the following steps:https://stackoverflow.com/a/47280970/9341589

and it is working perfectly on the "first day of each month", I want to parse all the month, and if it is possible all the year.

below the code I am using (to parse March 1, 2016):

from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.timeanddate.com/weather/usa/dayton/historic?month=3&year=2016"
page = urlopen(url)
soup = BeautifulSoup(page, "html.parser")

Data = []
table = soup.find('table', attrs={'id':'wt-his'})
for tr in table.find('tbody').find_all('tr'):
   dict = {}
   dict['time'] = tr.find('th').text.strip()
   all_td = tr.find_all('td')
   dict['temp'] = all_td[1].text
   dict['weather'] = all_td[2].text
   dict['wind'] = all_td[3].text
   arrow = all_td[4].text


   dict['humidity'] = all_td[5].text
   dict['barometer'] = all_td[6].text
   dict['visibility'] = all_td[7].text

   Data.append(dict)

this is the result for March 1:

This is because the website "url", the link only include the month and year, and to change the days, for instance, from Feb1 to Feb 3, the tab is shown in the pic attached needed to be used:

解决方案

You can iterate over the table elements (tr, th, and td) for a single page:

import requests, re, typing
from bs4 import BeautifulSoup as soup
import contextlib
def _remove(d:list) -> list:
   return list(filter(None, [re.sub('xa0', '', b) for b in d]))

@contextlib.contextmanager
def get_weather_data(url:str, by_url = True) -> typing.Generator[dict, None, None]:
   d = soup(requests.get(url).text if by_url else url, 'html.parser')
   _table = d.find('table', {'id':'wt-his'})
   _data = [[[i.text for i in c.find_all('th')], *[i.text for i in c.find_all('td')]] for c in _table.find_all('tr')]
   [h1], [h2], *data, _ = _data
   _h2 = _remove(h2)
   yield {tuple(_remove(h1)):[dict(zip(_h2, _remove([a, *i]))) for [[a], *i] in data]}


with get_weather_data('https://www.timeanddate.com/weather/usa/dayton/historic?month=2&year=2016') as weather:
 print(weather)

Output:

{('Conditions', 'Comfort'): [{'Time': '12:58 amMon, Feb 1', 'Temp': '50°F', 'Weather': 'Light rain. Mostly cloudy.', 'Wind': '13 mph', 'Humidity': '↑', 'Barometer': '88%', 'Visibility': '29.79 "Hg'}, {'Time': '1:58 am', 'Temp': '46°F', 'Weather': 'Mostly cloudy.', 'Wind': '12 mph', 'Humidity': '↑', 'Barometer': '83%', 'Visibility': '29.82 "Hg'}, {'Time': '2:58 am', 'Temp': '43°F', 'Weather': 'Mostly cloudy.', 'Wind': '14 mph', 'Humidity': '↑', 'Barometer': '85%', 'Visibility': '29.87 "Hg'}, {'Time': '3:58 am', 'Temp': '42°F', 'Weather': 'Mostly cloudy.', 'Wind': '10 mph', 'Humidity': '↑', 'Barometer': '83%', 'Visibility': '29.89 "Hg'}, {'Time': '4:58 am', 'Temp': '41°F', 'Weather': 'Mostly cloudy.', 'Wind': '10 mph', 'Humidity': '↑', 'Barometer': '82%', 'Visibility': '29.91 "Hg'}, {'Time': '5:58 am', 'Temp': '39°F', 'Weather': 'Mostly cloudy.', 'Wind': '8 mph', 'Humidity': '↑', 'Barometer': '83%', 'Visibility': '29.93 "Hg'}, {'Time': '6:58 am', 'Temp': '38°F', 'Weather': 'Partly cloudy.', 'Wind': '5 mph', 'Humidity': '↑', 'Barometer': '82%', 'Visibility': '29.96 "Hg'}, {'Time': '7:58 am', 'Temp': '38°F', 'Weather': 'Partly sunny.', 'Wind': '5 mph', 'Humidity': '↑', 'Barometer': '80%', 'Visibility': '29.99 "Hg'}, {'Time': '8:58 am', 'Temp': '38°F', 'Weather': 'Overcast.', 'Wind': '5 mph', 'Humidity': '↑', 'Barometer': '78%', 'Visibility': '30.01 "Hg'}, {'Time': '9:58 am', 'Temp': '40°F', 'Weather': 'Broken clouds.', 'Wind': '7 mph', 'Humidity': '↑', 'Barometer': 'N/A', 'Visibility': '30.01 "Hg'}, {'Time': '10:58 am', 'Temp': '41°F', 'Weather': 'Broken clouds.', 'Wind': '1 mph', 'Humidity': '↑', 'Barometer': '72%', 'Visibility': '30.02 "Hg'}, {'Time': '11:58 am', 'Temp': '41°F', 'Weather': 'Partly sunny.', 'Wind': '2 mph', 'Humidity': '↑', 'Barometer': '70%', 'Visibility': '30.04 "Hg'}, {'Time': '12:58 pm', 'Temp': '42°F', 'Weather': 'Scattered clouds.', 'Wind': '2 mph', 'Humidity': '↑', 'Barometer': '69%', 'Visibility': '30.04 "Hg'}, {'Time': '1:58 pm', 'Temp': '43°F', 'Weather': 'Partly sunny.', 'Wind': '3 mph', 'Humidity': '↑', 'Barometer': '65%', 'Visibility': '30.03 "Hg'}, {'Time': '2:58 pm', 'Temp': '44°F', 'Weather': 'Partly sunny.', 'Wind': 'No wind', 'Humidity': '↑', 'Barometer': '62%', 'Visibility': '30.02 "Hg'}, {'Time': '3:58 pm', 'Temp': '46°F', 'Weather': 'Passing clouds.', 'Wind': '6 mph', 'Humidity': '↑', 'Barometer': '58%', 'Visibility': '30.03 "Hg'}, {'Time': '4:58 pm', 'Temp': '46°F', 'Weather': 'Sunny.', 'Wind': '6 mph', 'Humidity': '↑', 'Barometer': '57%', 'Visibility': '30.04 "Hg'}, {'Time': '5:58 pm', 'Temp': '43°F', 'Weather': 'Clear.', 'Wind': '3 mph', 'Humidity': '↑', 'Barometer': '65%', 'Visibility': '30.06 "Hg'}, {'Time': '6:58 pm', 'Temp': '39°F', 'Weather': 'Clear.', 'Wind': '1 mph', 'Humidity': '↑', 'Barometer': '71%', 'Visibility': '30.09 "Hg'}, {'Time': '7:58 pm', 'Temp': '35°F', 'Weather': 'Clear.', 'Wind': '1 mph', 'Humidity': '↑', 'Barometer': '79%', 'Visibility': '30.11 "Hg'}, {'Time': '8:58 pm', 'Temp': '32°F', 'Weather': 'Clear.', 'Wind': 'No wind', 'Humidity': '↑', 'Barometer': '85%', 'Visibility': '30.13 "Hg'}, {'Time': '9:58 pm', 'Temp': '30°F', 'Weather': 'Clear.', 'Wind': 'No wind', 'Humidity': '↑', 'Barometer': '91%', 'Visibility': '30.14 "Hg'}, {'Time': '10:58 pm', 'Temp': '28°F', 'Weather': 'Clear.', 'Wind': '5 mph', 'Humidity': '↑', 'Barometer': '93%', 'Visibility': '30.14 "Hg'}, {'Time': '11:58 pm', 'Temp': '29°F', 'Weather': 'Clear.', 'Wind': 'No wind', 'Humidity': '↑', 'Barometer': '90%', 'Visibility': '30.13 "Hg'}]}

However, in order to scrape the data for all days in the desired month, selenium must be used, as the site dynamically updates the DOM via a request to the backend:

from selenium import webdriver
d = webdriver.Chrome('/Path/to/chromedriver')
d.get('https://www.timeanddate.com/weather/usa/dayton/historic?month=2&year=2016')
_d = {}
for i in d.find_element_by_id('wt-his-select').find_elements_by_tag_name('option'):
  i.click()
  with get_weather_data(d.page_source, False) as weather:
    _d[i.text] = weather

Edit: to iterate over the full data results, use dict.items:

for a, b in _d.items():
  pass #do something with a and b

这篇关于从网站 [timeanddate.com] 抓取表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆