从Tableau Public仪表板抓取数据 [英] Scrape Data from Tableau Public dashboard
问题描述
对于从网站上抓取数据的世界来说,我是一个陌生的人,对如何使用Tableau Public从网站上抓取数据感到迷茫
I am very new to the world of scraping data off of websites and am at a lost on how to grab data off of a website that is using Tableau Public
网站: https://showmestrong.mo.gov/data/public-health/
我一直在阅读有关如何检查元素并在其中找到表格的几种资料,但是我很茫然.我曾尝试在Python中使用 requests
和 BeautifulSoup
,但不知道该如何工作.
I've been reading up on several sources on how to inspect the elements and finding the table within it but I am at a loss. I've tried using in Python requests
and BeautifulSoup
but don't know how to work past that.
import requests
from bs4 import BeautifulSoup
import json
import re
r = requests.get("https://showmestrong.mo.gov/data/public-health/")
soup = BeautifulSoup(r.text, "html.parser")
并且似乎没有显示任何有关病例和死亡的表格.
and it doesn't seem to show any tables about cases and deaths for example.
任何有关此的提示或文档/论坛将不胜感激!
Any tips or documentation/forums about this would be appreciated!
推荐答案
tableau.js库似乎加载了另一个从中获取数据的URL:
The tableau.js library seems to load another url from which it gets the data :
From there, it's very similar to this answer and this one where you would extract a JSON configuration from a textarea
tag. Extract the sessionid
to build the URL to get the data :
import requests
from bs4 import BeautifulSoup
import json
import re
r = requests.get("https://public.tableau.com/views/COVID-19inMissouri/COVID-19inMissouri",
params = {
":embed": "y",
":showVizHome": "no",
":host_url": "https://public.tableau.com/",
":embed_code_version": 3,
":tabs": "no",
":toolbar": "no",
":animate_transition": "yes",
":display_static_image": "no",
":display_spinner": "no",
":display_overlay": "yes",
":display_count": "yes",
":language": "en",
":loadOrderID": 0
})
soup = BeautifulSoup(r.text, "html.parser")
tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)
dataUrl = f'https://public.tableau.com{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'
r = requests.post(dataUrl, data= {
"sheet_id": tableauData["sheetId"],
})
dataReg = re.search('\d+;({.*})\d+;({.*})', r.text, re.MULTILINE)
info = json.loads(dataReg.group(1))
data = json.loads(dataReg.group(2))
print(data["secondaryInfo"]["presModelMap"]["dataDictionary"]["presModelHolder"]["genDataDictionaryPresModel"]["dataSegments"]["0"]["dataColumns"])
结果不是JSON,因此需要使用正则表达式进行解析,以如上面的代码中所述从中提取JSON配置
The result is not JSON so it needs to be parsed using regex to extract the JSON configuration from it as depicted in the above code
这篇关于从Tableau Public仪表板抓取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!