BeautifulSoup 找不到网页上存在的类? [英] BeautifulSoup can't find class that exists on webpage?

查看:23
本文介绍了BeautifulSoup 找不到网页上存在的类?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我试图抓取以下网页 https://www.scoreboard.com/uk/football/england/premier-league/,

So I am trying to scrape the following webpage https://www.scoreboard.com/uk/football/england/premier-league/,

特别是预定的和完成的结果.因此,我试图寻找带有 class = "stage-finished" 或 "stage-scheduled" 的元素.但是,当我抓取网页并打印出 page_soup 包含的内容时,它不包含这些元素.

Specifically the scheduled and finished results. Thus I am trying to look for the elements with class = "stage-finished" or "stage-scheduled". However when I scrape the webpage and print out what page_soup contains, it doesn't contain these elements.

我发现了另一个 SO 问题,答案说这是因为它是通过 AJAX 加载的,我需要查看 chrome 开发工具上网络选项卡下的 XHR 以找到加载必要数据的文件,但是它没有好像不在?

I found another SO question with an answer saying that this is because it is loaded via AJAX and I need to look at the XHR under the network tab on chrome dev tools to find the file thats loading the necessary data, however it doesn't seem to be there?

import bs4
import requests
from bs4 import BeautifulSoup as soup
import csv
import datetime

myurl = "https://www.scoreboard.com/uk/football/england/premier-league/"
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = requests.get(myurl, headers=headers)

page_soup = soup(page.content, "html.parser")

scheduled = page_soup.select(".stage-scheduled")
finished = page_soup.select(".stage-finished")
live = page_soup.select(".stage-live")
print(page_soup)
print(scheduled[0])

上面的代码当然会抛出错误,因为调度数组中没有内容.

The above code throws an error of course as there is no content in the scheduled array.

我的问题是,如何获取我正在寻找的数据?

My question is, how do I go about getting the data I'm looking for?

我将 XHR 文件的内容复制到记事本并搜索了 stage-finished 和其他标签,但一无所获.我在这里遗漏了一些简单的东西吗?

I copied the contents of the XHR files to a notepad and searched for stage-finished and other tags and found nothing. Am I missing something easy here?

推荐答案

页面是 JavaScript 呈现的.你需要硒.下面是一些开始的代码:

The page is JavaScript rendered. You need Selenium. Here is some code to start on:

from selenium import webdriver

url = 'https://www.scoreboard.com/uk/football/england/premier-league/'

driver = webdriver.Chrome()
driver.get(url)
stages = driver.find_elements_by_class_name('stage-scheduled')
driver.close()

或者您可以将 driver.content 传入 BeautifulSoup 方法.像这样:

Or you could pass driver.content in to the BeautifulSoup method. Like this:

soup = BeautifulSoup(driver.page_source, 'html.parser')

注意:您需要先安装一个网络驱动程序.我安装了 chromedriver.

Note: You need to install a webdriver first. I installed chromedriver.

祝你好运!

这篇关于BeautifulSoup 找不到网页上存在的类?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆