刮擦使用Python使用Javascript创建的动态内容 [英] Scrape Dynamic contents created by Javascript using Python

查看:98
本文介绍了刮擦使用Python使用Javascript创建的动态内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用python脚本抓取由javascript函数创建的DIV内容.我已经尝试使用BS4,并且这样做无法获得动态数据.而是仅显示源代码.

I want to scrap DIV content created by javascript function by using python script. I have tried with BS4 and by doing with that i'm not able to get dynamic data. instead it shows only the source code.

示例代码:

import requests
from bs4 import BeautifulSoup

URL = "https://rawgit.com/skysoft999/tableauJS/master/example.html"
r = requests.get(URL)

soup = BeautifulSoup(r.content, 'html5lib')


for row in soup.findAll('div', attrs = {'class':'quote'}):
    print(row)


print(soup.prettify())

示例HTML源代码位于 Pastebin

Sample HTML source code is in Pastebin

要提取的样本数据:

推荐答案

初始HTML不包含您要抓取的数据,因此仅使用BeautifulSoup是不够的.您可以使用 Selenium 加载页面,然后抓取内容.

The initial HTML does not contain the data you want to scrape, that's why using only BeautifulSoup is not enough. You can load the page with Selenium and then scrape the content.

代码:

import json

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

html = None
url = 'http://demo-tableau.bitballoon.com/'
selector = '#dataTarget > div'
delay = 10  # seconds

browser = webdriver.Chrome()
browser.get(url)

try:
    # wait for button to be enabled
    WebDriverWait(browser, delay).until(
        EC.element_to_be_clickable((By.ID, 'getData'))
    )
    button = browser.find_element_by_id('getData')
    button.click()

    # wait for data to be loaded
    WebDriverWait(browser, delay).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, selector))
    )
except TimeoutException:
    print('Loading took too much time!')
else:
    html = browser.page_source
finally:
    browser.quit()

if html:
    soup = BeautifulSoup(html, 'lxml')
    raw_data = soup.select_one(selector).text
    data = json.loads(raw_data)

    import pprint
    pprint.pprint(data)

输出:

[[{'formattedValue': 'Atlantic', 'value': 'Atlantic'},
  {'formattedValue': '6/26/2010 3:00:00 AM', 'value': '2010-06-26 03:00:00'},
  {'formattedValue': 'ALEX', 'value': 'ALEX'},
  {'formattedValue': '16.70000', 'value': '16.7'},
  {'formattedValue': '-84.40000', 'value': '-84.4'},
  {'formattedValue': '30', 'value': '30'}],
  ...
]

该代码假定该按钮最初是禁用的:<button id="getData" onclick="getUnderlyingData()" disabled>Get Data</button>并且不会自动加载数据,但是由于单击了该按钮.因此,您需要删除以下行:setTimeout(function(){ getUnderlyingData(); }, 3000);.

The code assumes that the button is initially disabled: <button id="getData" onclick="getUnderlyingData()" disabled>Get Data</button> and data is not loaded automatically, but due to the button being clicked. Therefore you need to delete this line: setTimeout(function(){ getUnderlyingData(); }, 3000);.

您可以在此处找到示例的可行演示: http://demo-tableau.bitballoon.com/.

You can find a working demo of your example here: http://demo-tableau.bitballoon.com/.

这篇关于刮擦使用Python使用Javascript创建的动态内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆