使用 Python 抓取由 Javascript 创建的动态内容 [英] Scrape Dynamic contents created by Javascript using Python

查看:43
本文介绍了使用 Python 抓取由 Javascript 创建的动态内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 python 脚本废弃由 javascript 函数创建的 DIV 内容.我已经尝试过使用 BS4 并且通过这样做我无法获得动态数据.相反,它只显示源代码.

示例代码:

导入请求从 bs4 导入 BeautifulSoupURL = "https://rawgit.com/skysoft999/tableauJS/master/example.html"r = requests.get(URL)汤 = BeautifulSoup(r.content, 'html5lib')对于soup.findAll('div', attrs = {'class':'quote'}) 中的行:打印(行)打印(汤.美化())

示例 HTML 源代码位于

解决方案

初始 HTML 不包含您要抓取的数据,这就是为什么仅使用 BeautifulSoup 是不够的.您可以使用 Selenium 加载页面,然后抓取内容.>

代码:

导入json从 bs4 导入 BeautifulSoup从硒导入网络驱动程序从 selenium.webdriver.support.ui 导入 WebDriverWait从 selenium.webdriver.support 导入 expected_conditions 作为 ECfrom selenium.webdriver.common.by import By从 selenium.common.exceptions 导入 TimeoutExceptionhtml = 无url = 'http://demo-tableau.bitballoon.com/'选择器 = '#dataTarget >div'延迟 = 10 # 秒浏览器 = webdriver.Chrome()browser.get(url)尝试:# 等待按钮被启用WebDriverWait(浏览器,延迟).直到(EC.element_to_be_clickable((By.ID, 'getData')))button = browser.find_element_by_id('getData')button.click()# 等待数据加载WebDriverWait(浏览器,延迟).直到(EC.presence_of_element_located((By.CSS_SELECTOR, selector)))除了超时异常:print('加载时间过长!')别的:html = browser.page_source最后:浏览器退出()如果 html:汤 = BeautifulSoup(html, 'lxml')原始数据 = 汤.select_one(选择器).text数据 = json.loads(raw_data)导入打印pprint.pprint(数据)

输出:

[[{'formattedValue': 'Atlantic', 'value': 'Atlantic'},{'formattedValue':'6/26/2010 3:00:00 AM','value':'2010-06-26 03:00:00'},{'formattedValue':'亚历克斯','值':'亚历克斯'},{'formattedValue':'16.70000','值':'16.7'},{'formattedValue':'-84.40000','值':'-84.4'},{'formattedValue':'30','值':'30'}],...]

代码假设按钮最初是禁用的:<button id="getData" onclick="getUnderlyingData()" disabled>Get Data</button> 并且数据不会自动加载,但由于按钮被点击.因此你需要删除这一行:setTimeout(function(){ getUnderlyingData(); }, 3000);.

您可以在此处找到示例的工作演示:http://demo-tableau.bitballoon.com/.

I want to scrap DIV content created by javascript function by using python script. I have tried with BS4 and by doing with that i'm not able to get dynamic data. instead it shows only the source code.

Sample code:

import requests
from bs4 import BeautifulSoup

URL = "https://rawgit.com/skysoft999/tableauJS/master/example.html"
r = requests.get(URL)

soup = BeautifulSoup(r.content, 'html5lib')


for row in soup.findAll('div', attrs = {'class':'quote'}):
    print(row)


print(soup.prettify())

Sample HTML source code is in Pastebin

Sample data to be extracted:

解决方案

The initial HTML does not contain the data you want to scrape, that's why using only BeautifulSoup is not enough. You can load the page with Selenium and then scrape the content.

Code:

import json

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

html = None
url = 'http://demo-tableau.bitballoon.com/'
selector = '#dataTarget > div'
delay = 10  # seconds

browser = webdriver.Chrome()
browser.get(url)

try:
    # wait for button to be enabled
    WebDriverWait(browser, delay).until(
        EC.element_to_be_clickable((By.ID, 'getData'))
    )
    button = browser.find_element_by_id('getData')
    button.click()

    # wait for data to be loaded
    WebDriverWait(browser, delay).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, selector))
    )
except TimeoutException:
    print('Loading took too much time!')
else:
    html = browser.page_source
finally:
    browser.quit()

if html:
    soup = BeautifulSoup(html, 'lxml')
    raw_data = soup.select_one(selector).text
    data = json.loads(raw_data)

    import pprint
    pprint.pprint(data)

Output:

[[{'formattedValue': 'Atlantic', 'value': 'Atlantic'},
  {'formattedValue': '6/26/2010 3:00:00 AM', 'value': '2010-06-26 03:00:00'},
  {'formattedValue': 'ALEX', 'value': 'ALEX'},
  {'formattedValue': '16.70000', 'value': '16.7'},
  {'formattedValue': '-84.40000', 'value': '-84.4'},
  {'formattedValue': '30', 'value': '30'}],
  ...
]

The code assumes that the button is initially disabled: <button id="getData" onclick="getUnderlyingData()" disabled>Get Data</button> and data is not loaded automatically, but due to the button being clicked. Therefore you need to delete this line: setTimeout(function(){ getUnderlyingData(); }, 3000);.

You can find a working demo of your example here: http://demo-tableau.bitballoon.com/.

这篇关于使用 Python 抓取由 Javascript 创建的动态内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆