使用 Python 抓取由 Javascript 创建的动态内容 [英] Scrape Dynamic contents created by Javascript using Python
问题描述
我想使用 python 脚本废弃由 javascript 函数创建的 DIV 内容.我已经尝试过使用 BS4 并且通过这样做我无法获得动态数据.相反,它只显示源代码.
示例代码:
导入请求从 bs4 导入 BeautifulSoupURL = "https://rawgit.com/skysoft999/tableauJS/master/example.html"r = requests.get(URL)汤 = BeautifulSoup(r.content, 'html5lib')对于soup.findAll('div', attrs = {'class':'quote'}) 中的行:打印(行)打印(汤.美化())
示例 HTML 源代码位于
初始 HTML 不包含您要抓取的数据,这就是为什么仅使用 BeautifulSoup
是不够的.您可以使用 Selenium
加载页面,然后抓取内容.>
代码:
导入json从 bs4 导入 BeautifulSoup从硒导入网络驱动程序从 selenium.webdriver.support.ui 导入 WebDriverWait从 selenium.webdriver.support 导入 expected_conditions 作为 ECfrom selenium.webdriver.common.by import By从 selenium.common.exceptions 导入 TimeoutExceptionhtml = 无url = 'http://demo-tableau.bitballoon.com/'选择器 = '#dataTarget >div'延迟 = 10 # 秒浏览器 = webdriver.Chrome()browser.get(url)尝试:# 等待按钮被启用WebDriverWait(浏览器,延迟).直到(EC.element_to_be_clickable((By.ID, 'getData')))button = browser.find_element_by_id('getData')button.click()# 等待数据加载WebDriverWait(浏览器,延迟).直到(EC.presence_of_element_located((By.CSS_SELECTOR, selector)))除了超时异常:print('加载时间过长!')别的:html = browser.page_source最后:浏览器退出()如果 html:汤 = BeautifulSoup(html, 'lxml')原始数据 = 汤.select_one(选择器).text数据 = json.loads(raw_data)导入打印pprint.pprint(数据)
输出:
[[{'formattedValue': 'Atlantic', 'value': 'Atlantic'},{'formattedValue':'6/26/2010 3:00:00 AM','value':'2010-06-26 03:00:00'},{'formattedValue':'亚历克斯','值':'亚历克斯'},{'formattedValue':'16.70000','值':'16.7'},{'formattedValue':'-84.40000','值':'-84.4'},{'formattedValue':'30','值':'30'}],...]
代码假设按钮最初是禁用的:<button id="getData" onclick="getUnderlyingData()" disabled>Get Data</button>
并且数据不会自动加载,但由于按钮被点击.因此你需要删除这一行:setTimeout(function(){ getUnderlyingData(); }, 3000);
.
您可以在此处找到示例的工作演示:http://demo-tableau.bitballoon.com/.
I want to scrap DIV content created by javascript function by using python script. I have tried with BS4 and by doing with that i'm not able to get dynamic data. instead it shows only the source code.
Sample code:
import requests
from bs4 import BeautifulSoup
URL = "https://rawgit.com/skysoft999/tableauJS/master/example.html"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
for row in soup.findAll('div', attrs = {'class':'quote'}):
print(row)
print(soup.prettify())
Sample HTML source code is in Pastebin
Sample data to be extracted:
The initial HTML does not contain the data you want to scrape, that's why using only BeautifulSoup
is not enough. You can load the page with Selenium
and then scrape the content.
Code:
import json
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
html = None
url = 'http://demo-tableau.bitballoon.com/'
selector = '#dataTarget > div'
delay = 10 # seconds
browser = webdriver.Chrome()
browser.get(url)
try:
# wait for button to be enabled
WebDriverWait(browser, delay).until(
EC.element_to_be_clickable((By.ID, 'getData'))
)
button = browser.find_element_by_id('getData')
button.click()
# wait for data to be loaded
WebDriverWait(browser, delay).until(
EC.presence_of_element_located((By.CSS_SELECTOR, selector))
)
except TimeoutException:
print('Loading took too much time!')
else:
html = browser.page_source
finally:
browser.quit()
if html:
soup = BeautifulSoup(html, 'lxml')
raw_data = soup.select_one(selector).text
data = json.loads(raw_data)
import pprint
pprint.pprint(data)
Output:
[[{'formattedValue': 'Atlantic', 'value': 'Atlantic'},
{'formattedValue': '6/26/2010 3:00:00 AM', 'value': '2010-06-26 03:00:00'},
{'formattedValue': 'ALEX', 'value': 'ALEX'},
{'formattedValue': '16.70000', 'value': '16.7'},
{'formattedValue': '-84.40000', 'value': '-84.4'},
{'formattedValue': '30', 'value': '30'}],
...
]
The code assumes that the button is initially disabled: <button id="getData" onclick="getUnderlyingData()" disabled>Get Data</button>
and data is not loaded automatically, but due to the button being clicked. Therefore you need to delete this line: setTimeout(function(){ getUnderlyingData(); }, 3000);
.
You can find a working demo of your example here: http://demo-tableau.bitballoon.com/.
这篇关于使用 Python 抓取由 Javascript 创建的动态内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!