Beautifulsoup 不返回页面的完整 HTML [英] Beautifulsoup not returning complete HTML of the page

查看:32
本文介绍了Beautifulsoup 不返回页面的完整 HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在该网站上挖掘了一段时间,但无法找到解决我的问题的方法.我对网页抓取还很陌生,并尝试使用漂亮的汤简单地从网页中提取一些链接.

I have been digging on the site for some time and im unable to find the solution to my issue. Im fairly new to web scraping and trying to simply extract some links from a web page using beautiful soup.

url = "https://www.sofascore.com/pt/futebol/2018-09-18"
page = urlopen(url).read()
soup = BeautifulSoup(page, "lxml")
print(soup)

在最基本的层面上,我要做的就是访问网站内的特定标签.我可以自己解决其余的问题,但我挣扎的部分是我正在寻找的标签不在输出中.

At the most basic level, all im trying to do is access a specific tag within the website. I can work out the rest for myself, but the part im struggling with is the fact that a tag that I am looking for is not in the output.

例如:使用内置的 find() 我可以获取以下 div 类标签:class="l__grid js-page-layout"

For example: using the built in find() I can grab the following div class tag: class="l__grid js-page-layout"

然而,我实际上正在寻找的是嵌入在树中较低级别的标签的内容.
js-event-list-tournament-events

However what i'm actually looking for are the contents of a tag that is embedded at a lower level in the tree.
js-event-list-tournament-events

当我对较低级别的标签执行相同的查找操作时,我没有得到任何结果.

When I perform the same find operation on the lower-level tag, I get no results.

使用基于 Azure 的 Jupyter Notebook,我尝试了许多解决 stackoverflow 上类似问题的解决方案,但没有成功.

Using Azure-based Jupyter Notebook, i have tried a number of the solutions to similar problems on stackoverflow and no luck.

谢谢!肯尼

推荐答案

页面使用JS动态加载数据,所以必须使用selenium.检查下面的代码.请注意,您必须安装 selenium 和 chromedrive(解压缩文件并复制进入python文件夹)

The page use JS to load the data dynamically so you have to use selenium. Check below code. Note you have to install selenium and chromedrive (unzip the file and copy into python folder)

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

url = "https://www.sofascore.com/pt/futebol/2018-09-18"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=options)
driver.get(url)
time.sleep(3)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
container = soup.find_all('div', attrs={
    'class':'js-event-list-tournament-events'})
print(container)

或者你可以使用他们的 json api

or you can use their json api

import requests
url = 'https://www.sofascore.com/football//2018-09-18/json'
r = requests.get(url)
print(r.json())

这篇关于Beautifulsoup 不返回页面的完整 HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆