使用BeautifulSoup和Selenium解析网站 [英] Parsing a website with BeautifulSoup and Selenium

查看:209
本文介绍了使用BeautifulSoup和Selenium解析网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试比较平均水平.通过从以下位置将温度从以下温度刮至实际温度: https://usclimatedata.com/climate/binghamton/new-york/united-states/usny0124

Trying to compare avg. temperatures to actual temperatures by scraping them from: https://usclimatedata.com/climate/binghamton/new-york/united-states/usny0124

我可以成功收集网页的源代码,但是我在解析它时遇到了麻烦,只能在历史记录"标签下仅给出高温,低温,降雨和平均值的值,但我似乎无法来解决正确的类/id,而不会得到唯一的结果"None".

I can successfully gather the webpage's source code, but I am having trouble parsing through it to only give the values for the high temps, low temps, rainfall and the averages under the "History" tab, but I can't seem to address the right class/id without getting the only result as "None".

这是我到目前为止的内容,最后一行只是尝试获得高温:

This is what I have so far, with the last line being an attempt to get the high temps only:

from lxml import html
from bs4 import BeautifulSoup
from selenium import webdriver

url = "https://usclimatedata.com/climate/binghamton/new-york/unitedstates/usny0124"
browser = webdriver.Chrome()
browser.get(url)
soup = BeautifulSoup(browser.page_source, "lxml")
data = soup.find("table", {'class': "align_right_climate_table_data_td_temperature_red"})

推荐答案

首先,这是两个不同的类-align_righttemperature_red-您出于某种原因将它们加入并添加了table_data_td.并且,具有这两类的元素是td元素,而不是table.

First of all, these are two different classes - align_right and temperature_red - you've joined them and added that table_data_td for some reason. And, the elements having these two classes are td elements, not table.

在任何情况下,要获取气候表,您都应该寻找具有id="climate_table"div元素:

In any case, to get the climate table, it looks like you should be looking for the div element having id="climate_table":

climate_table = soup.find(id="climate_table")

还要注意的另一件重要事情是,此处可能存在计时"问题-当您获得driver.page_source值时,气候信息可能就不存在了.导航至页面后,通常可以添加显式等待:

Another important thing to note that there is a potential for the "timing" issues here - when you get the driver.page_source value, the climate information might not be there. This is usually approached adding an Explicit Wait after navigating to the page:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup


url = "https://usclimatedata.com/climate/binghamton/new-york/unitedstates/usny0124"
browser = webdriver.Chrome()

try:
    browser.get(url)

    # wait for the climate data to be loaded
    WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.ID, "climate_table")))

    soup = BeautifulSoup(browser.page_source, "lxml")
    climate_table = soup.find(id="climate_table")

    print(climate_table.prettify())
finally:
    browser.quit()

请注意,添加的try/finally可以在发生错误时安全地关闭浏览器-这也有助于避免挂起"浏览器窗口.

Note the addition of the try/finally that would safely close the browser in case of an error - that would also help to avoid "hanging" browser windows.

然后,查看 pandas.read_html() 可以将您的气候信息表读取到 DataFrame 自动神奇地.

And, look into pandas.read_html() that can read your climate information table into a DataFrame auto-magically.

这篇关于使用BeautifulSoup和Selenium解析网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆