Beautifulsoup不返回页面的完整HTML [英] Beautifulsoup not returning complete HTML of the page

查看:111
本文介绍了Beautifulsoup不返回页面的完整HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在该网站上进行了一段时间的挖掘,但无法找到解决我问题的方法.我对网页抓取非常陌生,它尝试使用漂亮的汤简单地从网页中提取一些链接.

I have been digging on the site for some time and im unable to find the solution to my issue. Im fairly new to web scraping and trying to simply extract some links from a web page using beautiful soup.

url = "https://www.sofascore.com/pt/futebol/2018-09-18"
page = urlopen(url).read()
soup = BeautifulSoup(page, "lxml")
print(soup)

在最基本的级别上,我要做的就是访问网站中的特定标签.我可以自己解决其余的问题,但是困扰的部分是我要查找的标签不在输出中.

At the most basic level, all im trying to do is access a specific tag within the website. I can work out the rest for myself, but the part im struggling with is the fact that a tag that I am looking for is not in the output.

例如:使用内置的find(),我可以获取以下div类标记: class ="l__grid js-page-layout"

For example: using the built in find() I can grab the following div class tag: class="l__grid js-page-layout"

但是,我真正要寻找的是嵌入在树中较低级别的标记的内容.
js-event-list-锦标赛事件

However what i'm actually looking for are the contents of a tag that is embedded at a lower level in the tree.
js-event-list-tournament-events

当我在较低级别的标签上执行相同的查找操作时,没有任何结果.

When I perform the same find operation on the lower-level tag, I get no results.

我使用基于Azure的Jupyter Notebook,尝试了多种解决方案,以解决stackoverflow上的类似问题,并且没有运气.

Using Azure-based Jupyter Notebook, i have tried a number of the solutions to similar problems on stackoverflow and no luck.

谢谢! 肯尼

推荐答案

该页面使用JS来动态加载数据,因此您必须使用硒.检查下面的代码. 请注意,您必须安装selenium和 chromedrive (解压缩文件并复制到python文件夹中)

The page use JS to load the data dynamically so you have to use selenium. Check below code. Note you have to install selenium and chromedrive (unzip the file and copy into python folder)

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

url = "https://www.sofascore.com/pt/futebol/2018-09-18"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=options)
driver.get(url)
time.sleep(3)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
container = soup.find_all('div', attrs={
    'class':'js-event-list-tournament-events'})
print(container)

或者您可以使用他们的json API

or you can use their json api

import requests
url = 'https://www.sofascore.com/football//2018-09-18/json'
r = requests.get(url)
print(r.json())

这篇关于Beautifulsoup不返回页面的完整HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆