HTML中的div元素上的Beautiful Soup循环 [英] Beautiful Soup loop over div element in HTML

查看：99 发布时间：2021/4/15 19:01:25 python html pandas beautifulsoup

本文介绍了HTML中的div元素上的Beautiful Soup循环的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正尝试使用Beautiful Soup从网页中提取一些值(此处不是很多智慧.)，这些值是

在Python中，我可以尝试模仿Web浏览器并找到以下值:

 导入请求将bs4导入为BeautifulSoup将熊猫作为pd导入从bs4导入BeautifulSoupurl ='https://www.weatherbug.com/weather-forecast/hourly/san-francisco-ca-94103'标头= {"User-Agent":"Mozilla/5.0(X11； Linux x86_64)AppleWebKit/537.36(KHTML，例如Gecko)Chrome/50.0.2661.75 Safari/537.36"，"X-Requested-With":"XMLHttpRequest"}页面= requests.get(URL，headers = header)汤= BeautifulSoup(page.text，'html.parser')

使用下面的代码，我可以找到这些 hour-card_mobile_cond div类中的12个，这似乎是正确的，因为在搜索小时预报时，我可以看到12小时/未来数据的变量.我不确定为什么我要选择一种移动设备方法来查看...(?)

  temp_containers = soup.find_all('div'，class_ ='hour-card__mobile__cond')打印(类型(temp_containers))打印(len(temp_containers))

输出:

 < class'bs4.element.ResultSet'>12

如果我尝试编写一些代码来遍历所有这些div类以进一步深入研究，那么我在下面做错了什么.我可以返回12个空列表.提升?最终，我希望将所有12个每小时的小时预测值都放入一个熊猫数据框中.

temp_containers中的div的

 :a = div.find_all('div'，class_ ='temp ng-binding')打印(a)

编辑，基于带有熊猫数据框答案的完整代码

 导入请求从bs4导入BeautifulSoup将熊猫作为pd导入r = request.get("https://www.weatherbug.com/weather-forecast/hourly/san-francisco-ca-94103")汤= BeautifulSoup(r.text，'html.parser')东西= []用于汤中的项目.select("div.hour-card__mobile__cond"):item = int(item.contents [1] .get_text(strip = True)[:-1])打印(项目)stuff.append(项目)df = pd.DataFrame(stuff)df.columns = ['temp']

解决方案

页面加载后，将通过 JavaScript 动态加载网站.因此您可以使用 requests-html 或硒.

硒导入Webdriver中的

 从selenium.webdriver.firefox.options导入选项选项=选项()options.add_argument('-无头')驱动程序= webdriver.Firefox(options = options)driver.get("https://www.weatherbug.com/weather-forecast/hourly/san-francisco-ca-94103")数据= driver.find_elements_by_css_selector("div.temp.ng-binding")对于数据中的项目:打印(item.text)driver.quit()

输出:

  51°52°53°54°53°53°52°51°51°50度50度49°

根据用户请求更新:

 导入请求从bs4导入BeautifulSoupr = request.get("https://www.weatherbug.com/weather-forecast/hourly/san-francisco-ca-94103")汤= BeautifulSoup(r.text，'html.parser')用于汤中的项目.select("div.hour-card__mobile__cond"):item = int(item.contents [1] .get_text(strip = True)[:-1])打印(项目，类型(项目))

输出:

  51< class'int'>52< class'int'>53< class'int'>53< class'int'>53< class'int'>53< class'int'>52< class'int'>51< class'int'>51< class'int'>50< class'int'>50< class'int'>50< class'int'>

I am attempting to use Beautiful Soup to extract some values out of a web page (not very much wisdom here..) which are hourly values from a weatherbug forecast. In Chrome developer mode I can see the values are nested within the div classes as shown in the snip below:

In Python I can attempt to mimic a web browser and find these values:

import requests
import bs4 as BeautifulSoup
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://www.weatherbug.com/weather-forecast/hourly/san-francisco-ca-94103'

header = {
  "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
  "X-Requested-With": "XMLHttpRequest"
}

page = requests.get(url, headers=header)

soup = BeautifulSoup(page.text, 'html.parser')

With the code below, I can find 12 of these hour-card_mobile_cond div classes which seems about correct as when searching for hourly forecast I can see 12 hours/variables of future data. Im not sure why I am picking up a mobile device method to view...(?)

temp_containers = soup.find_all('div', class_ = 'hour-card__mobile__cond')
print(type(temp_containers))
print(len(temp_containers))

Output:

<class 'bs4.element.ResultSet'>
12

I am doing something incorrect below if I attempt to make up some code to loop thru all these div classes to dive down a little further.. I can 12 empty lists returned.. Would anyone have a tip at all where I can improve? Ultimately I am looking to put all 12 future hourly forecasted values into a pandas dataframe.

for div in temp_containers:
    a = div.find_all('div', class_ = 'temp ng-binding')
    print(a)

EDIT, complete code based on answer with pandas dataframe

import requests
from bs4 import BeautifulSoup
import pandas as pd


r = requests.get(
    "https://www.weatherbug.com/weather-forecast/hourly/san-francisco-ca-94103")
soup = BeautifulSoup(r.text, 'html.parser')

stuff = []

for item in soup.select("div.hour-card__mobile__cond"):
    item = int(item.contents[1].get_text(strip=True)[:-1])
    print(item)
    stuff.append(item)


df = pd.DataFrame(stuff)
df.columns = ['temp']

解决方案

The website is loaded via JavaScript dynamically once the page loads. so you can use requests-html or selenium.

from selenium import webdriver
from selenium.webdriver.firefox.options import Options

options = Options()
options.add_argument('--headless')
driver = webdriver.Firefox(options=options)

driver.get(
    "https://www.weatherbug.com/weather-forecast/hourly/san-francisco-ca-94103")


data = driver.find_elements_by_css_selector("div.temp.ng-binding")

for item in data:
    print(item.text)

driver.quit()

Output:

51°

52°

53°

54°

53°

53°

52°

51°

51°

50°

50°

49°

Updated per user-request:

import requests
from bs4 import BeautifulSoup

r = requests.get(
    "https://www.weatherbug.com/weather-forecast/hourly/san-francisco-ca-94103")
soup = BeautifulSoup(r.text, 'html.parser')

for item in soup.select("div.hour-card__mobile__cond"):
    item = int(item.contents[1].get_text(strip=True)[:-1])
    print(item, type(item))

Output:

51 <class 'int'>
52 <class 'int'>
53 <class 'int'>
53 <class 'int'>
53 <class 'int'>
53 <class 'int'>
52 <class 'int'>
51 <class 'int'>
51 <class 'int'>
50 <class 'int'>
50 <class 'int'>
50 <class 'int'>

这篇关于HTML中的div元素上的Beautiful Soup循环的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

HTML中的div元素上的Beautiful Soup循环 [英] Beautiful Soup loop over div element in HTML

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

HTML中的div元素上的Beautiful Soup循环 [英] Beautiful Soup loop over div element in HTML

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭