如何从嵌入在网页中的 Tableau 图形中抓取工具提示值 [英] How can I scrape tooltips value from a Tableau graph embedded in a webpage

查看:27
本文介绍了如何从嵌入在网页中的 Tableau 图形中抓取工具提示值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想弄清楚是否有办法以及如何使用 python 从网页中的 Tableau 嵌入图形中抓取工具提示值.

I am trying to figure out if there is a way and how to scrape tooltip values from a Tableau embedded graph in a webpage using python.

以下是用户将鼠标悬停在条形上方时带有工具提示的图表示例:

Here is an example of a graph with tooltips when user hovers over the bars:

https://public.tableau.com/views/NumberofCOVID-19patientsadmittedordischarged/DASHPublicpage_patientsdischarges?:embed=y&:showVizHome=no&:host_url=https%3A%2F%2Fpublic.tableau.com%2F&:embed_code_version=3&:tabs=no&:toolbar=yes&:animate_transition=yes&:display_static_image=no&:display_spinner=no&:display_overlay=yes&:display_count=yes&ID:loadOder&=1

我从我想从中抓取的原始网页中抓取了这个网址:

I grabbed this url from the original webpage that I want to scrape from:

https://covid19.colorado.gov/hospital-data

感谢任何帮助.

推荐答案

编辑

我制作了一个用于抓取画面仪表板的 Python 库.实现更简单:

from tableauscraper import TableauScraper as TS

url = "https://public.tableau.com/views/Colorado_COVID19_Data/CO_Home"

ts = TS()
ts.loads(url)
dashboard = ts.getDashboard()

for t in dashboard.worksheets:
    #show worksheet name
    print(f"WORKSHEET NAME : {t.name}")
    #show dataframe for this worksheet
    print(t.data)

在 repl.it 上运行这个

该图形似乎是根据 API 的结果在 JS 中生成的,如下所示:

The graphic seems to be generated in JS from the result of an API which looks like :

POST https://public.tableau.com/TITLE/bootstrapSession/sessions/SESSION_ID 

SESSION_ID 参数位于(除其他外)用于构建 iframe 的 URL 中的 tsConfigContainer 文本区域.

The SESSION_ID parameter is located (among other things) in tsConfigContainer textarea in the URL used to build the iframe.

https://covid19.colorado.gov/hospital-data 开始:

  • 使用 tableauPlaceholder 类检查元素
  • 获取带有属性name
  • param元素
  • 它为您提供网址:https://public.tableau.com/views/{urlPath}
  • 上一个链接为您提供了一个带有 id tsConfigContainer 和一堆 json 值的文本区域
  • 提取session_id和根路径(vizql_root)
  • 使用 sheetId 作为表单数据在 https://public.tableau.com/ROOT_PATH/bootstrapSession/sessions/SESSION_ID 上发布 POST
  • 从结果中提取json(结果不是json)
  • check element with class tableauPlaceholder
  • get the param element with attribute name
  • it gives you the url : https://public.tableau.com/views/{urlPath}
  • the previous link gives you a textarea with id tsConfigContainer with a bunch of json values
  • extract the session_id and root path (vizql_root)
  • make a POST on https://public.tableau.com/ROOT_PATH/bootstrapSession/sessions/SESSION_ID with the sheetId as form data
  • extract the json from the result (result is not json)

代码:

import requests
from bs4 import BeautifulSoup
import json
import re

r = requests.get("https://covid19.colorado.gov/hospital-data")
soup = BeautifulSoup(r.text, "html.parser")

# get the second tableau link
tableauContainer = soup.findAll("div", { "class": "tableauPlaceholder"})[1]
urlPath = tableauContainer.find("param", { "name": "name"})["value"]

r = requests.get(
    f"https://public.tableau.com/views/{urlPath}",
    params= {
        ":showVizHome":"no",
    }
)
soup = BeautifulSoup(r.text, "html.parser")

tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)

dataUrl = f'https://public.tableau.com{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'

r = requests.post(dataUrl, data= {
    "sheet_id": tableauData["sheetId"],
})

dataReg = re.search('d+;({.*})d+;({.*})', r.text, re.MULTILINE)
info = json.loads(dataReg.group(1))
data = json.loads(dataReg.group(2))

print(data["secondaryInfo"]["presModelMap"]["dataDictionary"]["presModelHolder"]["genDataDictionaryPresModel"]["dataSegments"]["0"]["dataColumns"])

从那里您可以获得所有数据.您将需要寻找拆分数据的方式,因为似乎所有数据都通过单个列表转储.可能查看 JSON 对象中的其他字段对此很有用.

From there you have all the data. You will need to look for the way the data is splitted as it seems all the data is dumped through a single list. Probably looking at the other fields in the JSON object would be useful for that.

这篇关于如何从嵌入在网页中的 Tableau 图形中抓取工具提示值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆