使用BeautifulSoup和Selenium解析网站 [英] Parsing a website with BeautifulSoup and Selenium
问题描述
尝试比较平均水平.通过从以下位置将温度从以下温度刮至实际温度: https://usclimatedata.com/climate/binghamton/new-york/united-states/usny0124
Trying to compare avg. temperatures to actual temperatures by scraping them from: https://usclimatedata.com/climate/binghamton/new-york/united-states/usny0124
我可以成功收集网页的源代码,但是我在解析它时遇到了麻烦,只能在历史记录"标签下仅给出高温,低温,降雨和平均值的值,但我似乎无法来解决正确的类/id,而不会得到唯一的结果"None".
I can successfully gather the webpage's source code, but I am having trouble parsing through it to only give the values for the high temps, low temps, rainfall and the averages under the "History" tab, but I can't seem to address the right class/id without getting the only result as "None".
这是我到目前为止的内容,最后一行只是尝试获得高温:
This is what I have so far, with the last line being an attempt to get the high temps only:
from lxml import html
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://usclimatedata.com/climate/binghamton/new-york/unitedstates/usny0124"
browser = webdriver.Chrome()
browser.get(url)
soup = BeautifulSoup(browser.page_source, "lxml")
data = soup.find("table", {'class': "align_right_climate_table_data_td_temperature_red"})
推荐答案
首先,这是两个不同的类-align_right
和temperature_red
-您出于某种原因将它们加入并添加了table_data_td
.并且,具有这两类的元素是td
元素,而不是table
.
First of all, these are two different classes - align_right
and temperature_red
- you've joined them and added that table_data_td
for some reason. And, the elements having these two classes are td
elements, not table
.
在任何情况下,要获取气候表,您都应该寻找具有id="climate_table"
的div
元素:
In any case, to get the climate table, it looks like you should be looking for the div
element having id="climate_table"
:
climate_table = soup.find(id="climate_table")
还要注意的另一件重要事情是,此处可能存在计时"问题-当您获得driver.page_source
值时,气候信息可能就不存在了.导航至页面后,通常可以添加显式等待:
Another important thing to note that there is a potential for the "timing" issues here - when you get the driver.page_source
value, the climate information might not be there. This is usually approached adding an Explicit Wait after navigating to the page:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
url = "https://usclimatedata.com/climate/binghamton/new-york/unitedstates/usny0124"
browser = webdriver.Chrome()
try:
browser.get(url)
# wait for the climate data to be loaded
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.ID, "climate_table")))
soup = BeautifulSoup(browser.page_source, "lxml")
climate_table = soup.find(id="climate_table")
print(climate_table.prettify())
finally:
browser.quit()
请注意,添加的try/finally
可以在发生错误时安全地关闭浏览器-这也有助于避免挂起"浏览器窗口.
Note the addition of the try/finally
that would safely close the browser in case of an error - that would also help to avoid "hanging" browser windows.
然后,查看 pandas.read_html()
可以将您的气候信息表读取到 DataFrame
自动神奇地.
And, look into pandas.read_html()
that can read your climate information table into a DataFrame
auto-magically.
这篇关于使用BeautifulSoup和Selenium解析网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!