从Tableau Public仪表板抓取数据 [英] Scrape Data from Tableau Public dashboard

查看:41
本文介绍了从Tableau Public仪表板抓取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于从网站上抓取数据的世界来说,我是一个陌生的人,对如何使用Tableau Public从网站上抓取数据感到迷茫

I am very new to the world of scraping data off of websites and am at a lost on how to grab data off of a website that is using Tableau Public

网站: https://showmestrong.mo.gov/data/public-health/

我一直在阅读有关如何检查元素并在其中找到表格的几种资料,但是我很茫然.我曾尝试在Python中使用 requests BeautifulSoup ,但不知道该如何工作.

I've been reading up on several sources on how to inspect the elements and finding the table within it but I am at a loss. I've tried using in Python requests and BeautifulSoup but don't know how to work past that.

import requests
from bs4 import BeautifulSoup
import json
import re

r = requests.get("https://showmestrong.mo.gov/data/public-health/")
soup = BeautifulSoup(r.text, "html.parser")

并且似乎没有显示任何有关病例和死亡的表格.

and it doesn't seem to show any tables about cases and deaths for example.

任何有关此的提示或文档/论坛将不胜感激!

Any tips or documentation/forums about this would be appreciated!

推荐答案

tableau.js库似乎加载了另一个从中获取数据的URL:

The tableau.js library seems to load another url from which it gets the data :

从那里开始,它与此答案

From there, it's very similar to this answer and this one where you would extract a JSON configuration from a textarea tag. Extract the sessionid to build the URL to get the data :

import requests
from bs4 import BeautifulSoup
import json
import re

r = requests.get("https://public.tableau.com/views/COVID-19inMissouri/COVID-19inMissouri", 
    params = {
    ":embed": "y",
    ":showVizHome": "no",
    ":host_url": "https://public.tableau.com/",
    ":embed_code_version": 3,
    ":tabs": "no",
    ":toolbar": "no",
    ":animate_transition": "yes",
    ":display_static_image": "no",
    ":display_spinner": "no",
    ":display_overlay": "yes",
    ":display_count": "yes",
    ":language": "en",
    ":loadOrderID": 0
})
soup = BeautifulSoup(r.text, "html.parser")

tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)

dataUrl = f'https://public.tableau.com{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'

r = requests.post(dataUrl, data= {
    "sheet_id": tableauData["sheetId"],
})
dataReg = re.search('\d+;({.*})\d+;({.*})', r.text, re.MULTILINE)
info = json.loads(dataReg.group(1))
data = json.loads(dataReg.group(2))

print(data["secondaryInfo"]["presModelMap"]["dataDictionary"]["presModelHolder"]["genDataDictionaryPresModel"]["dataSegments"]["0"]["dataColumns"])

结果不是JSON,因此需要使用正则表达式进行解析,以如上面的代码中所述从中提取JSON配置

The result is not JSON so it needs to be parsed using regex to extract the JSON configuration from it as depicted in the above code

在repl.it上运行此

这篇关于从Tableau Public仪表板抓取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆