如何从嵌入在网页中的 Tableau 图形中抓取工具提示值 [英] How can I scrape tooltips value from a Tableau graph embedded in a webpage
问题描述
我想弄清楚是否有办法以及如何使用 python 从网页中的 Tableau 嵌入图形中抓取工具提示值.
I am trying to figure out if there is a way and how to scrape tooltip values from a Tableau embedded graph in a webpage using python.
以下是用户将鼠标悬停在条形上方时带有工具提示的图表示例:
Here is an example of a graph with tooltips when user hovers over the bars:
我从我想从中抓取的原始网页中抓取了这个网址:
I grabbed this url from the original webpage that I want to scrape from:
https://covid19.colorado.gov/hospital-data
感谢任何帮助.
推荐答案
编辑
我制作了一个用于抓取画面仪表板的 Python 库.实现更简单:
from tableauscraper import TableauScraper as TS
url = "https://public.tableau.com/views/Colorado_COVID19_Data/CO_Home"
ts = TS()
ts.loads(url)
dashboard = ts.getDashboard()
for t in dashboard.worksheets:
#show worksheet name
print(f"WORKSHEET NAME : {t.name}")
#show dataframe for this worksheet
print(t.data)
该图形似乎是根据 API 的结果在 JS 中生成的,如下所示:
The graphic seems to be generated in JS from the result of an API which looks like :
POST https://public.tableau.com/TITLE/bootstrapSession/sessions/SESSION_ID
SESSION_ID 参数位于(除其他外)用于构建 iframe 的 URL 中的 tsConfigContainer
文本区域.
The SESSION_ID parameter is located (among other things) in tsConfigContainer
textarea in the URL used to build the iframe.
从 https://covid19.colorado.gov/hospital-data 开始:
- 使用
tableauPlaceholder
类检查元素 - 获取带有属性
name
的 - 它为您提供网址:
https://public.tableau.com/views/{urlPath}
- 上一个链接为您提供了一个带有 id
tsConfigContainer
和一堆 json 值的文本区域 - 提取
session_id
和根路径(vizql_root
) - 使用
sheetId
作为表单数据在https://public.tableau.com/ROOT_PATH/bootstrapSession/sessions/SESSION_ID
上发布 POST - 从结果中提取json(结果不是json)
param
元素- check element with class
tableauPlaceholder
- get the
param
element with attributename
- it gives you the url :
https://public.tableau.com/views/{urlPath}
- the previous link gives you a textarea with id
tsConfigContainer
with a bunch of json values - extract the
session_id
and root path (vizql_root
) - make a POST on
https://public.tableau.com/ROOT_PATH/bootstrapSession/sessions/SESSION_ID
with thesheetId
as form data - extract the json from the result (result is not json)
代码:
import requests
from bs4 import BeautifulSoup
import json
import re
r = requests.get("https://covid19.colorado.gov/hospital-data")
soup = BeautifulSoup(r.text, "html.parser")
# get the second tableau link
tableauContainer = soup.findAll("div", { "class": "tableauPlaceholder"})[1]
urlPath = tableauContainer.find("param", { "name": "name"})["value"]
r = requests.get(
f"https://public.tableau.com/views/{urlPath}",
params= {
":showVizHome":"no",
}
)
soup = BeautifulSoup(r.text, "html.parser")
tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)
dataUrl = f'https://public.tableau.com{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'
r = requests.post(dataUrl, data= {
"sheet_id": tableauData["sheetId"],
})
dataReg = re.search('d+;({.*})d+;({.*})', r.text, re.MULTILINE)
info = json.loads(dataReg.group(1))
data = json.loads(dataReg.group(2))
print(data["secondaryInfo"]["presModelMap"]["dataDictionary"]["presModelHolder"]["genDataDictionaryPresModel"]["dataSegments"]["0"]["dataColumns"])
从那里您可以获得所有数据.您将需要寻找拆分数据的方式,因为似乎所有数据都通过单个列表转储.可能查看 JSON 对象中的其他字段对此很有用.
From there you have all the data. You will need to look for the way the data is splitted as it seems all the data is dumped through a single list. Probably looking at the other fields in the JSON object would be useful for that.
这篇关于如何从嵌入在网页中的 Tableau 图形中抓取工具提示值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!