网站抓取/Beautifulsoup/有时不返回? [英] Webscraping / Beautifulsoup / sometimes None-return?
问题描述
我尝试抓取某个网页上的某些信息,并且在一个网页上它可以正常工作,但是在另一个网页上却无法正常工作,因为我只获得了无返回值
此代码/网页运行正常:
#https://realpython.com/beautiful-soup-web-scraper-python/汇入要求从bs4导入BeautifulSoupURL ="https://www.monster.at/jobs/suche/?q=Software-Devel&where=Graz"页面= requests.get(URL)汤= BeautifulSoup(page.content,"html.parser")name_box = soup.findAll("div",attrs = {"class":"company"})打印(名称框)
但是使用此代码/网页,我只能获得None作为返回值
#https://www.freecodecamp.org/news/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe/汇入要求从bs4导入BeautifulSoupURL ="https://www.bloomberg.com/quote/SPX:IND";页面= requests.get(URL)汤= BeautifulSoup(page.content,"html.parser")name_box = soup.find("h1",attrs = {"class":"companyName__99a4824b"))打印(名称框)
那是为什么?
(起初,我认为由于第二个网页上的班级编号"companyName__99a4824b",它会动态更改班级名称-情况并非如此-当我刷新网页时,它仍然是相同的班级名称...)
未获得 None
的原因是,当用户在页面上时,彭博页面使用Java脚本加载其内容.>
BeautifulSoup
会简单地向您返回页面到达页面时所发现的html,其中不包含 companyName_99a4824b
类标记.
只有在用户等待页面完全加载后,HTML才会包含所需的标记.
如果要抓取这些数据,则需要使用类似 Selenium ,您可以指示它等待页面所需的元素准备就绪.
i try to scrape some informations from a webpage and on the one page it is working fine, but on the other webpage it is not working cause i only get a none return-value
This code / webpage is working fine:
# https://realpython.com/beautiful-soup-web-scraper-python/
import requests
from bs4 import BeautifulSoup
URL = "https://www.monster.at/jobs/suche/?q=Software-Devel&where=Graz"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
name_box = soup.findAll("div", attrs={"class": "company"})
print (name_box)
But with this code / webpage i only get a None as return-value
# https://www.freecodecamp.org/news/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe/
import requests
from bs4 import BeautifulSoup
URL = "https://www.bloomberg.com/quote/SPX:IND"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
name_box = soup.find("h1", attrs={"class": "companyName__99a4824b"})
print (name_box)
Why is that?
(at first i thought due the number in the class on the second webpage "companyName__99a4824b" it changes the classname dynamicly - but this is not the case - when i refresh the webpage it is still the same classname...)
The reason you get None
is that the Bloomberg page uses Javascript to load its content while the user is on the page.
BeautifulSoup
simply returns to you the html of the page as found as soon as it reaches the page -- which does not contain the companyName_99a4824b
class-tag.
Only after the user has waited for the page to fully load does the html include the desired tag.
If you want to scrape that data, you'll need to use something like Selenium, which you can instruct to wait until the desired element of the page is ready.
这篇关于网站抓取/Beautifulsoup/有时不返回?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!