从网站 [timeanddate.com] 抓取表格 [英] Scraping table from website [timeanddate.com]
问题描述
我想从
这是因为网站url",链接只包含月份和年份,并且要更改日期,例如从2月1日到2月3日,需要使用的标签显示在附图中:
您可以迭代表元素(tr
、th
和 td代码>) 用于单个页面:
导入请求,重新,输入从 bs4 导入 BeautifulSoup 作为汤导入上下文库def _remove(d:list) ->列表:return list(filter(None, [re.sub('xa0', '', b) for b in d]))@contextlib.contextmanagerdef get_weather_data(url:str, by_url = True) ->打字.发电机[字典,无,无]:d = 汤(requests.get(url).text if by_url else url, 'html.parser')_table = d.find('table', {'id':'wt-his'})_data = [[[i.text for i in c.find_all('th')], *[i.text for i in c.find_all('td')]] for c in _table.find_all('tr')][h1], [h2], *data, _ = _data_h2 = _remove(h2)产生 {tuple(_remove(h1)):[dict(zip(_h2, _remove([a, *i]))) for [[a], *i] in data]}使用 get_weather_data('https://www.timeanddate.com/weather/usa/dayton/historic?month=2&year=2016') 作为天气:打印(天气)
输出:
{('Conditions', 'Comfort'): [{'Time': '12:58 amMon, Feb 1', 'Temp': '50°F', 'Weather': '小雨.多云.', 'Wind': '13 mph', 'Humidity': '↑', 'Barometer': '88%', 'Visibility': '29.79 "Hg'}, {'Time': '1:58 am', 'Temp': '46°F', 'Weather': '多云', 'Wind': '12 mph', 'Humidity': '↑', 'Barometer': '83%','能见度':'29.82Hg'},{'时间':'2:58 am','Temp':'43°F','天气':'多云.','风':'14 mph', '湿度': '↑', '气压计': '85%', '能见度': '29.87 "Hg'}, {'Time': '3:58 am', 'Temp': '42°F', 'Weather': '多云', 'Wind': '10 mph', 'Humidity': '↑', 'Barometer': '83%', 'Visibility': '29.89 "Hg'}, {'时间':'4:58 am','Temp':'41°F','天气':'多云.','风':'10 mph','湿度':'↑','气压计': '82%', 'Visibility': '29.91 "Hg'}, {'Time': '5:58 am', 'Temp': '39°F', 'Weather': '多云.','风':'8 mph','湿度':'↑','气压计':'83%','能见度':'29.93 "Hg'},{'时间':'6:58 am','Temp': '38°F', 'Weather': '部分多云', 'Wind': '5 mph', 'Humidity': '↑', 'Barometer': '82%', 'Visibility': '29.96 "汞"}, {'时间': '7:58am', 'Temp': '38°F', 'Weather': '部分晴天', 'Wind': '5 mph', 'Humidity': '↑', 'Barometer': '80%', '能见度': '29.99 "Hg'}, {'Time': '8:58 am', 'Temp': '38°F', 'Weather': '阴天', 'Wind': '5 mph','湿度':'↑','气压计':'78%','能见度':'30.01 "Hg'},{'时间':'9:58 am','温度':'40°F','天气':'破碎的云彩.','风':'7 mph','湿度':'↑','气压计':'N/A','能见度':'30.01 "Hg'},{'时间':'10:58 am','Temp':'41°F','天气':'破碎的云层.','风':'1 mph','湿度':'↑','气压计': '72%', 'Visibility': '30.02 "Hg'}, {'Time': '11:58 am', 'Temp': '41°F', 'Weather': '部分晴天.', 'Wind': '2 mph', 'Humidity': '↑', 'Barometer': '70%', 'Visibility': '30.04 "Hg'}, {'Time': '12:58 pm', 'Temp':'42°F','天气':'散布的云.','风':'2 mph','湿度':'↑','气压计':'69%','能见度':'30.04"Hg'}, {'Time': '1:58 pm', 'Temp': '43°F', 'Weather': '部分晴天.', 'Wind': '3 mph', 'Humidity':'↑', '气压计': '65%', '能见度': '30.03 "Hg'}, {'Time': '2:58 pm', 'Temp': '44°F', 'Weather': '部分晴天.', 'Wind': '无风', '湿度': '↑', 'Barometer': '62%', 'Visibility': '30.02 "Hg'}, {'Time': '3:58 pm', 'Temp': '46°F', 'Weather': '掠过云层', 'Wind': '6 mph', 'Humidity': '↑', 'Barometer': '58%', 'Visibility': '30.03 "Hg'}, {'Time': '下午 4 点 58 分'、'温度':'46°F'、'天气':'晴天'、'风':'6 英里/小时'、'湿度':'↑'、'气压计':'57%', '能见度': '30.04 "Hg'}, {'Time': '5:58 pm', 'Temp': '43°F', 'Weather': 'Clear.', 'Wind': '3 mph', '湿度': '↑', '气压计': '65%', '能见度': '30.06 "Hg'}, {'Time': '6:58 pm', 'Temp': '39°F', 'Weather': 'Clear.', 'Wind': '1 mph', 'Humidity': '↑', 'Barometer': '71%', 'Visibility': '30.09 "Hg'}, {'时间':'7:58 pm','温度':'35°F','天气':'晴朗','风':'1 mph','湿度':'↑','气压计':'79%', 'Visibility': '30.11 "Hg'}, {'Time': '8:58 pm', 'Temp': '32°F', 'Weather': 'Clear.', 'Wind':无风",湿度":↑",气压计":85%",能见度":30.13"Hg'}, {'Time': '9:58 pm', 'Temp': '30°F', 'Weather': 'Clear.', 'Wind': '无风', 'Humidity': '↑', 'Barometer': '91%', 'Visibility': '30.14 "Hg'}, {'Time': '10:58 pm', 'Temp': '28°F', 'Weather': 'Clear.', 'Wind': '5 mph', 'Humidity': '↑', 'Barometer': '93%', 'Visibility': '30.14 "Hg'}, {'Time': '11:58 pm', 'Temp': '29°F', 'Weather': 'Clear.', 'Wind': '无风', 'Humidity': '↑', 'Barometer': '90%', 'Visibility': '30.13 "汞'}]}
但是,为了抓取所需月份中所有天的数据,必须使用 selenium
,因为该站点通过对后端的请求动态更新 DOM:
from selenium import webdriverd = webdriver.Chrome('/Path/to/chromedriver')d.get('https://www.timeanddate.com/weather/usa/dayton/historic?month=2&year=2016')_d = {}对于我在 d.find_element_by_id('wt-his-select').find_elements_by_tag_name('option'):i.click()使用 get_weather_data(d.page_source, False) 作为天气:_d[i.text] = 天气
迭代完整的数据结果,使用dict.items
:
for a, b in _d.items():pass #用a和b做点什么
I want to get the historical hourly weather data from https://www.timeanddate.com/
This is the website link:https://www.timeanddate.com/weather/usa/dayton/historic?month=2&year=2016 - Here I am selecting February and 2016, and the result will appear in the bottom of the page.
I used the following steps:https://stackoverflow.com/a/47280970/9341589
and it is working perfectly on the "first day of each month", I want to parse all the month, and if it is possible all the year.
below the code I am using (to parse March 1, 2016):
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.timeanddate.com/weather/usa/dayton/historic?month=3&year=2016"
page = urlopen(url)
soup = BeautifulSoup(page, "html.parser")
Data = []
table = soup.find('table', attrs={'id':'wt-his'})
for tr in table.find('tbody').find_all('tr'):
dict = {}
dict['time'] = tr.find('th').text.strip()
all_td = tr.find_all('td')
dict['temp'] = all_td[1].text
dict['weather'] = all_td[2].text
dict['wind'] = all_td[3].text
arrow = all_td[4].text
dict['humidity'] = all_td[5].text
dict['barometer'] = all_td[6].text
dict['visibility'] = all_td[7].text
Data.append(dict)
this is the result for March 1:
This is because the website "url", the link only include the month and year, and to change the days, for instance, from Feb1 to Feb 3, the tab is shown in the pic attached needed to be used:
You can iterate over the table elements (tr
, th
, and td
) for a single page:
import requests, re, typing
from bs4 import BeautifulSoup as soup
import contextlib
def _remove(d:list) -> list:
return list(filter(None, [re.sub('xa0', '', b) for b in d]))
@contextlib.contextmanager
def get_weather_data(url:str, by_url = True) -> typing.Generator[dict, None, None]:
d = soup(requests.get(url).text if by_url else url, 'html.parser')
_table = d.find('table', {'id':'wt-his'})
_data = [[[i.text for i in c.find_all('th')], *[i.text for i in c.find_all('td')]] for c in _table.find_all('tr')]
[h1], [h2], *data, _ = _data
_h2 = _remove(h2)
yield {tuple(_remove(h1)):[dict(zip(_h2, _remove([a, *i]))) for [[a], *i] in data]}
with get_weather_data('https://www.timeanddate.com/weather/usa/dayton/historic?month=2&year=2016') as weather:
print(weather)
Output:
{('Conditions', 'Comfort'): [{'Time': '12:58 amMon, Feb 1', 'Temp': '50°F', 'Weather': 'Light rain. Mostly cloudy.', 'Wind': '13 mph', 'Humidity': '↑', 'Barometer': '88%', 'Visibility': '29.79 "Hg'}, {'Time': '1:58 am', 'Temp': '46°F', 'Weather': 'Mostly cloudy.', 'Wind': '12 mph', 'Humidity': '↑', 'Barometer': '83%', 'Visibility': '29.82 "Hg'}, {'Time': '2:58 am', 'Temp': '43°F', 'Weather': 'Mostly cloudy.', 'Wind': '14 mph', 'Humidity': '↑', 'Barometer': '85%', 'Visibility': '29.87 "Hg'}, {'Time': '3:58 am', 'Temp': '42°F', 'Weather': 'Mostly cloudy.', 'Wind': '10 mph', 'Humidity': '↑', 'Barometer': '83%', 'Visibility': '29.89 "Hg'}, {'Time': '4:58 am', 'Temp': '41°F', 'Weather': 'Mostly cloudy.', 'Wind': '10 mph', 'Humidity': '↑', 'Barometer': '82%', 'Visibility': '29.91 "Hg'}, {'Time': '5:58 am', 'Temp': '39°F', 'Weather': 'Mostly cloudy.', 'Wind': '8 mph', 'Humidity': '↑', 'Barometer': '83%', 'Visibility': '29.93 "Hg'}, {'Time': '6:58 am', 'Temp': '38°F', 'Weather': 'Partly cloudy.', 'Wind': '5 mph', 'Humidity': '↑', 'Barometer': '82%', 'Visibility': '29.96 "Hg'}, {'Time': '7:58 am', 'Temp': '38°F', 'Weather': 'Partly sunny.', 'Wind': '5 mph', 'Humidity': '↑', 'Barometer': '80%', 'Visibility': '29.99 "Hg'}, {'Time': '8:58 am', 'Temp': '38°F', 'Weather': 'Overcast.', 'Wind': '5 mph', 'Humidity': '↑', 'Barometer': '78%', 'Visibility': '30.01 "Hg'}, {'Time': '9:58 am', 'Temp': '40°F', 'Weather': 'Broken clouds.', 'Wind': '7 mph', 'Humidity': '↑', 'Barometer': 'N/A', 'Visibility': '30.01 "Hg'}, {'Time': '10:58 am', 'Temp': '41°F', 'Weather': 'Broken clouds.', 'Wind': '1 mph', 'Humidity': '↑', 'Barometer': '72%', 'Visibility': '30.02 "Hg'}, {'Time': '11:58 am', 'Temp': '41°F', 'Weather': 'Partly sunny.', 'Wind': '2 mph', 'Humidity': '↑', 'Barometer': '70%', 'Visibility': '30.04 "Hg'}, {'Time': '12:58 pm', 'Temp': '42°F', 'Weather': 'Scattered clouds.', 'Wind': '2 mph', 'Humidity': '↑', 'Barometer': '69%', 'Visibility': '30.04 "Hg'}, {'Time': '1:58 pm', 'Temp': '43°F', 'Weather': 'Partly sunny.', 'Wind': '3 mph', 'Humidity': '↑', 'Barometer': '65%', 'Visibility': '30.03 "Hg'}, {'Time': '2:58 pm', 'Temp': '44°F', 'Weather': 'Partly sunny.', 'Wind': 'No wind', 'Humidity': '↑', 'Barometer': '62%', 'Visibility': '30.02 "Hg'}, {'Time': '3:58 pm', 'Temp': '46°F', 'Weather': 'Passing clouds.', 'Wind': '6 mph', 'Humidity': '↑', 'Barometer': '58%', 'Visibility': '30.03 "Hg'}, {'Time': '4:58 pm', 'Temp': '46°F', 'Weather': 'Sunny.', 'Wind': '6 mph', 'Humidity': '↑', 'Barometer': '57%', 'Visibility': '30.04 "Hg'}, {'Time': '5:58 pm', 'Temp': '43°F', 'Weather': 'Clear.', 'Wind': '3 mph', 'Humidity': '↑', 'Barometer': '65%', 'Visibility': '30.06 "Hg'}, {'Time': '6:58 pm', 'Temp': '39°F', 'Weather': 'Clear.', 'Wind': '1 mph', 'Humidity': '↑', 'Barometer': '71%', 'Visibility': '30.09 "Hg'}, {'Time': '7:58 pm', 'Temp': '35°F', 'Weather': 'Clear.', 'Wind': '1 mph', 'Humidity': '↑', 'Barometer': '79%', 'Visibility': '30.11 "Hg'}, {'Time': '8:58 pm', 'Temp': '32°F', 'Weather': 'Clear.', 'Wind': 'No wind', 'Humidity': '↑', 'Barometer': '85%', 'Visibility': '30.13 "Hg'}, {'Time': '9:58 pm', 'Temp': '30°F', 'Weather': 'Clear.', 'Wind': 'No wind', 'Humidity': '↑', 'Barometer': '91%', 'Visibility': '30.14 "Hg'}, {'Time': '10:58 pm', 'Temp': '28°F', 'Weather': 'Clear.', 'Wind': '5 mph', 'Humidity': '↑', 'Barometer': '93%', 'Visibility': '30.14 "Hg'}, {'Time': '11:58 pm', 'Temp': '29°F', 'Weather': 'Clear.', 'Wind': 'No wind', 'Humidity': '↑', 'Barometer': '90%', 'Visibility': '30.13 "Hg'}]}
However, in order to scrape the data for all days in the desired month, selenium
must be used, as the site dynamically updates the DOM via a request to the backend:
from selenium import webdriver
d = webdriver.Chrome('/Path/to/chromedriver')
d.get('https://www.timeanddate.com/weather/usa/dayton/historic?month=2&year=2016')
_d = {}
for i in d.find_element_by_id('wt-his-select').find_elements_by_tag_name('option'):
i.click()
with get_weather_data(d.page_source, False) as weather:
_d[i.text] = weather
Edit: to iterate over the full data results, use dict.items
:
for a, b in _d.items():
pass #do something with a and b
这篇关于从网站 [timeanddate.com] 抓取表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!