如何使用BeautifulSoup抓取页面?页面源不匹配检查元素 [英] How to scrape page with BeautifulSoup? Page Source not matching Inspect Element

查看:122
本文介绍了如何使用BeautifulSoup抓取页面?页面源不匹配检查元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从此幻想篮球页面.我正在Python 3.5+中使用BeautifulSoup来做到这一点.

I'm trying to scrape a few things from this fantasy basketball page. I'm using BeautifulSoup in Python 3.5+ to do this.

source_code = requests.get('http://fantasy.espn.com/basketball/league/standings?leagueId=633975')
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'lxml')

首先,我想将9个类别的标题抓取到Python列表中.因此,我的列表应类似于categories = [FG%, FT%, 3PM, REB, AST, STL, BLK, TO, PTS].

To begin with, I'd like to scrape the titles for the 9 categories into a Python list. So my list should look like categories = [FG%, FT%, 3PM, REB, AST, STL, BLK, TO, PTS].

我希望做的事情如下:

tableSubHead = soup.find_all('tr', class_='Table2__header-row')
tableSubHead = tableSubHead[0]
listCats = tableSubHead.find_all('th')
categories = []
for cat in listCats:
  if 'title' in cat.attrs:
  categories.append(cat.string)

但是,soup.find_all('tr', class_='Table2__header-row')返回一个空列表,而不是我想要的表行元素.我怀疑这是因为当我查看页面源代码时,它与Chrome Dev Tools中的Inspect Element完全不同.我了解这是因为Javascript动态更改了页面上的元素,但是我不确定解决方案是什么.

However, the soup.find_all('tr', class_='Table2__header-row') returns an empty list instead of the table row element I want. I suspect this is because when I view the page source, it's completely different from Inspect Element in Chrome Dev Tools. I understand this is because Javascript changes the elements on the page dynamically, but I'm not sure what the solution would be.

推荐答案

您面临的问题是因为该网站是一个网络应用程序,这意味着必须运行javascript才能生成您所看到的内容,您可以不要使用request运行javascript,这是我使用selenium来获得结果的方法,它打开了无头浏览器,并通过等待一段时间来使javascript首先运行:

The problem you're facing is because this website is a web-app, which means javascript will have to run to generate what you're seeing, you can't run javascript with request, here's what I did to get the result with selenium which opens a headless browser and enable javascript to run first by waiting for a period of time:

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import time

# source_code = requests.get('http://fantasy.espn.com/basketball/league/standings?leagueId=633975')

options = webdriver.ChromeOptions()
options.add_argument('headless')
capa = DesiredCapabilities.CHROME
capa["pageLoadStrategy"] = "none"
driver = webdriver.Chrome(chrome_options=options, desired_capabilities=capa)
driver.set_window_size(1440,900)
driver.get('http://fantasy.espn.com/basketball/league/standings?leagueId=633975')
time.sleep(15)

plain_text = driver.page_source
soup = BeautifulSoup(plain_text, 'lxml')

soup.select('.Table2__header-row') # Returns full results.

len(soup.select('.Table2__header-row')) # 8

这种方法将允许您运行设计为webapp的网站,并大大扩展您的功能. -您甚至可以添加要执行的命令,例如滚动或单击以加载广告投放中的更多来源.

This approach will allow you to run website that are designed as a webapp, and greatly expand your functionality. - you can even add commands to execute like scrolling or clicking to load more sources on the flight.

使用pip install selenium安装硒.如果您更喜欢Firefox,也可以使用Firefox.

Use pip install selenium to install selenium. Also allows you to use Firefox if you prefer that browser.

这篇关于如何使用BeautifulSoup抓取页面?页面源不匹配检查元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆