网站抓取/Beautifulsoup/有时不返回? [英] Webscraping / Beautifulsoup / sometimes None-return?

查看:87
本文介绍了网站抓取/Beautifulsoup/有时不返回?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试抓取某个网页上的某些信息,并且在一个网页上它可以正常工作,但是在另一个网页上却无法正常工作,因为我只获得了无返回值

此代码/网页运行正常:

 #https://realpython.com/beautiful-soup-web-scraper-python/汇入要求从bs4导入BeautifulSoupURL ="https://www.monster.at/jobs/suche/?q=Software-Devel&where=Graz"页面= requests.get(URL)汤= BeautifulSoup(page.content,"html.parser")name_box = soup.findAll("div",attrs = {"class":"company"})打印(名称框) 

但是使用此代码/网页,我只能获得None作为返回值

 #https://www.freecodecamp.org/news/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe/汇入要求从bs4导入BeautifulSoupURL ="https://www.bloomberg.com/quote/SPX:IND";页面= requests.get(URL)汤= BeautifulSoup(page.content,"html.parser")name_box = soup.find("h1",attrs = {"class":"companyName__99a4824b"))打印(名称框) 

那是为什么?

(起初,我认为由于第二个网页上的班级编号"companyName__99a4824b",它会动态更改班级名称-情况并非如此-当我刷新网页时,它仍然是相同的班级名称...)

解决方案

未获得 None 的原因是,当用户在页面上时,彭博页面使用Java脚本加载其内容.

BeautifulSoup 会简单地向您返回页面到达页面时所发现的html,其中不包含 companyName_99a4824b 类标记.

>

只有在用户等待页面完全加载后,HTML才会包含所需的标记.

如果要抓取这些数据,则需要使用类似 Selenium ,您可以指示它等待页面所需的元素准备就绪.

i try to scrape some informations from a webpage and on the one page it is working fine, but on the other webpage it is not working cause i only get a none return-value

This code / webpage is working fine:

# https://realpython.com/beautiful-soup-web-scraper-python/
import requests
from bs4 import BeautifulSoup

URL = "https://www.monster.at/jobs/suche/?q=Software-Devel&where=Graz"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")

name_box = soup.findAll("div", attrs={"class": "company"})
print (name_box)

But with this code / webpage i only get a None as return-value

# https://www.freecodecamp.org/news/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe/

import requests
from bs4 import BeautifulSoup

URL = "https://www.bloomberg.com/quote/SPX:IND"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")


name_box = soup.find("h1", attrs={"class": "companyName__99a4824b"})
print (name_box)

Why is that?

(at first i thought due the number in the class on the second webpage "companyName__99a4824b" it changes the classname dynamicly - but this is not the case - when i refresh the webpage it is still the same classname...)

解决方案

The reason you get None is that the Bloomberg page uses Javascript to load its content while the user is on the page.

BeautifulSoup simply returns to you the html of the page as found as soon as it reaches the page -- which does not contain the companyName_99a4824b class-tag.

Only after the user has waited for the page to fully load does the html include the desired tag.

If you want to scrape that data, you'll need to use something like Selenium, which you can instruct to wait until the desired element of the page is ready.

这篇关于网站抓取/Beautifulsoup/有时不返回?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆