美丽的汤循环在 HTML 中的 div 元素上 [英] Beautiful Soup loop over div element in HTML
问题描述
我正在尝试使用 Beautiful Soup 从网页中提取一些值(这里不是很聪明..),这些值是
在 Python 中,我可以尝试模拟 Web 浏览器并找到这些值:
导入请求将 bs4 导入为 BeautifulSoup将熊猫导入为 pd从 bs4 导入 BeautifulSoupurl = 'https://www.weatherbug.com/weather-forecast/hourly/san-francisco-ca-94103'标题 = {"用户代理": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36","X-Requested-With": "XMLHttpRequest"}page = requests.get(url, headers=header)汤 = BeautifulSoup(page.text, 'html.parser')
使用下面的代码,我可以找到其中 12 个 hour-card_mobile_cond
div 类,这似乎是正确的,因为在搜索每小时预测时,我可以看到 12 小时/未来数据的变量.我不知道为什么我要选择一种移动设备方法来查看...(?)
temp_containers = soup.find_all('div', class_ = 'hour-card__mobile__cond')打印(类型(临时容器))打印(len(temp_containers))
输出:
12
如果我尝试编写一些代码来循环遍历所有这些 div 类以进一步深入,我在下面做的事情不正确..我可以返回 12 个空列表.. 任何人都可以给我提示提升?最终,我希望将所有 12 个未来每小时预测值放入一个 Pandas 数据框中.
用于 temp_containers 中的 div:a = div.find_all('div', class_ = 'temp ng-binding')打印(一)
编辑,基于熊猫数据框的答案的完整代码
导入请求从 bs4 导入 BeautifulSoup将熊猫导入为 pdr = 请求.get("https://www.weatherbug.com/weather-forecast/hourly/san-francisco-ca-94103")汤 = BeautifulSoup(r.text, 'html.parser')东西 = []对于soup.select("div.hour-card__mobile__cond") 中的项目:item = int(item.contents[1].get_text(strip=True)[:-1])打印(项目)东西.附加(项目)df = pd.DataFrame(东西)df.columns = ['temp']
一旦页面加载,网站就会通过 JavaScript
动态加载.所以你可以使用 requests-html 或 selenium
.
from selenium import webdriver从 selenium.webdriver.firefox.options 导入选项选项 = 选项()options.add_argument('--headless')驱动程序 = webdriver.Firefox(选项=选项)驱动程序.get("https://www.weatherbug.com/weather-forecast/hourly/san-francisco-ca-94103")data = driver.find_elements_by_css_selector("div.temp.ng-binding")对于数据中的项目:打印(项目.文本)驱动程序退出()
输出:
51°52°53°54°53°53°52°51°51°50°50°49°
根据用户请求更新:
导入请求从 bs4 导入 BeautifulSoupr = 请求.get("https://www.weatherbug.com/weather-forecast/hourly/san-francisco-ca-94103")汤 = BeautifulSoup(r.text, 'html.parser')对于soup.select("div.hour-card__mobile__cond") 中的项目:item = int(item.contents[1].get_text(strip=True)[:-1])打印(项目,类型(项目))
输出:
51 52 <类'int'>53 <类'int'>53 <类'int'>53 <类'int'>53 <类'int'>52 <类'int'>51 <类'int'>51 <类'int'>50 50 50
I am attempting to use Beautiful Soup to extract some values out of a web page (not very much wisdom here..) which are hourly values from a weatherbug forecast. In Chrome developer mode I can see the values are nested within the div
classes as shown in the snip below:
In Python I can attempt to mimic a web browser and find these values:
import requests
import bs4 as BeautifulSoup
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.weatherbug.com/weather-forecast/hourly/san-francisco-ca-94103'
header = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"
}
page = requests.get(url, headers=header)
soup = BeautifulSoup(page.text, 'html.parser')
With the code below, I can find 12 of these hour-card_mobile_cond
div classes which seems about correct as when searching for hourly forecast I can see 12 hours/variables of future data. Im not sure why I am picking up a mobile device method to view...(?)
temp_containers = soup.find_all('div', class_ = 'hour-card__mobile__cond')
print(type(temp_containers))
print(len(temp_containers))
Output:
<class 'bs4.element.ResultSet'>
12
I am doing something incorrect below if I attempt to make up some code to loop thru all these div classes to dive down a little further.. I can 12 empty lists returned.. Would anyone have a tip at all where I can improve? Ultimately I am looking to put all 12 future hourly forecasted values into a pandas dataframe.
for div in temp_containers:
a = div.find_all('div', class_ = 'temp ng-binding')
print(a)
EDIT, complete code based on answer with pandas dataframe
import requests
from bs4 import BeautifulSoup
import pandas as pd
r = requests.get(
"https://www.weatherbug.com/weather-forecast/hourly/san-francisco-ca-94103")
soup = BeautifulSoup(r.text, 'html.parser')
stuff = []
for item in soup.select("div.hour-card__mobile__cond"):
item = int(item.contents[1].get_text(strip=True)[:-1])
print(item)
stuff.append(item)
df = pd.DataFrame(stuff)
df.columns = ['temp']
The website is loaded via JavaScript
dynamically once the page loads. so you can use requests-html or selenium
.
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
options = Options()
options.add_argument('--headless')
driver = webdriver.Firefox(options=options)
driver.get(
"https://www.weatherbug.com/weather-forecast/hourly/san-francisco-ca-94103")
data = driver.find_elements_by_css_selector("div.temp.ng-binding")
for item in data:
print(item.text)
driver.quit()
Output:
51°
52°
53°
54°
53°
53°
52°
51°
51°
50°
50°
49°
Updated per user-request:
import requests
from bs4 import BeautifulSoup
r = requests.get(
"https://www.weatherbug.com/weather-forecast/hourly/san-francisco-ca-94103")
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.select("div.hour-card__mobile__cond"):
item = int(item.contents[1].get_text(strip=True)[:-1])
print(item, type(item))
Output:
51 <class 'int'>
52 <class 'int'>
53 <class 'int'>
53 <class 'int'>
53 <class 'int'>
53 <class 'int'>
52 <class 'int'>
51 <class 'int'>
51 <class 'int'>
50 <class 'int'>
50 <class 'int'>
50 <class 'int'>
这篇关于美丽的汤循环在 HTML 中的 div 元素上的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!