如何仅在使用Python请求加载数据后才抓取html表? [英] How to scrape html table only after data loads using Python Requests?

查看:248
本文介绍了如何仅在使用Python请求加载数据后才抓取html表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用python学习数据抓取,并且一直在使用Requests和BeautifulSoup4库.它适用于普通网站.但是,当我尝试从经过一段时间延迟加载表数据的网站上获取一些数据时,我发现我得到了一个空表.例如此网页

I am trying to learn data scraping using python and have been using the Requests and BeautifulSoup4 libraries. It works well for normal websites. But when I tried to get some data out of websites where the table data loads after some delay, I found that I get an empty table. An examples would be this webpage

我尝试过的脚本是一个相当常规的脚本.

The script I've tried is a fairly routine one.

import requests
from bs4 import BeautifulSoup

response = requests.get("http://www.oddsportal.com/soccer/england/premier-league/everton-arsenal-tnWxil2o#over-under;2")
soup = BeautifulSoup(response.text, "html.parser")

content = soup.find('div', {'id': 'odds-data-portal'})

数据加载到页面的表odds-data-portal中,但是代码没有给我那样的信息.如何确保表中已加载数据并首先获取它?

The data loads in the table odds-data-portal in the page but the code doesn't give me that. How can I make sure the table is loaded with data and get it first?

推荐答案

您将需要使用selenium之类的东西来获取html.您仍然可以继续使用BeautifulSoup来解析它,如下所示:

You will need to use something like selenium to get the html. You could though continue to use BeautifulSoup to parse it as follows:

from bs4 import BeautifulSoup
from operator import itemgetter
from selenium import webdriver

url = "http://www.oddsportal.com/soccer/england/premier-league/everton-arsenal-tnWxil2o#over-under;2"
browser = webdriver.Firefox()

browser.get(url)
soup = BeautifulSoup(browser.page_source)
data_table = soup.find('div', {'id': 'odds-data-table'})

for div in data_table.find_all_next('div', class_='table-container'):
    row = div.find_all(['span', 'strong'])

    if len(row):
        print ','.join(cell.get_text(strip=True) for cell in itemgetter(0, 4, 3, 2, 1)(row))

这将显示:

Over/Under +0.5,(8),1.04,11.91,95.5%
Over/Under +0.75,(1),1.04,10.00,94.2%
Over/Under +1,(1),1.04,11.00,95.0%
Over/Under +1.25,(2),1.13,5.88,94.8%
Over/Under +1.5,(9),1.21,4.31,94.7%
Over/Under +1.75,(2),1.25,3.93,94.8%
Over/Under +2,(2),1.31,3.58,95.9%
Over/Under +2.25,(4),1.52,2.59,95.7%   


更新-@JRodDynamite的建议,可以使用无头PhantomJS来代替Firefox.为此:


Update - as suggested by @JRodDynamite, to run the headless PhantomJS can be used instead of Firefox. To do this:

  1. 下载 PhantomJS Windows二进制文件.

提取phantomjs.exe可执行文件并确保它在您的PATH中.

Extract the phantomjs.exe executable and ensure it is in your PATH.

更改以下行:browser = webdriver.PhantomJS()

这篇关于如何仅在使用Python请求加载数据后才抓取html表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆