如何从嵌入在网页中的Tableau图形中抓取工具提示值 [英] How can I scrape tooltips value from a Tableau graph embedded in a webpage

查看:187
本文介绍了如何从嵌入在网页中的Tableau图形中抓取工具提示值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试找出是否存在一种方法以及如何使用python从网页中的Tableau嵌入式图中抓取工具提示值.

I am trying to figure out if there is a way and how to scrape tooltip values from a Tableau embedded graph in a webpage using python.

以下是当用户将鼠标悬停在条形图上时带有工具提示的图形的示例:

Here is an example of a graph with tooltips when user hovers over the bars:

我从我要从其抓取的原始网页中获取了该网址:

I grabbed this url from the original webpage that I want to scrape from:

https://covid19.colorado.gov/hospital-data

感谢您的帮助.

推荐答案

该图形似乎是根据JS的API结果生成的,该图形如下:

The graphic seems to be generated in JS from the result of an API which looks like :

POST https://public.tableau.com/TITLE/bootstrapSession/sessions/SESSION_ID 

SESSION_ID参数(除其他事项外)位于用于构建iframe的URL的tsConfigContainer文本区域中.

The SESSION_ID parameter is located (among other things) in tsConfigContainer textarea in the URL used to build the iframe.

https://covid19.colorado.gov/hospital-data 开始:

  • 检查类为tableauPlaceholder
  • 的元素
  • 获取具有属性name
  • param元素
  • 它为您提供了网址:https://public.tableau.com/views/{urlPath}
  • 上一个链接为您提供了ID为tsConfigContainer并带有一堆json值的文本区域
  • 提取session_id和根路径(vizql_root)
  • 使用sheetId作为表单数据在https://public.tableau.com/ROOT_PATH/bootstrapSession/sessions/SESSION_ID上进行POST
  • 从结果中提取json(结果不是json)
  • check element with class tableauPlaceholder
  • get the param element with attribute name
  • it gives you the url : https://public.tableau.com/views/{urlPath}
  • the previous link gives you a textarea with id tsConfigContainer with a bunch of json values
  • extract the session_id and root path (vizql_root)
  • make a POST on https://public.tableau.com/ROOT_PATH/bootstrapSession/sessions/SESSION_ID with the sheetId as form data
  • extract the json from the result (result is not json)

代码:

import requests
from bs4 import BeautifulSoup
import json
import re

r = requests.get("https://covid19.colorado.gov/hospital-data")
soup = BeautifulSoup(r.text, "html.parser")

# get the second tableau link
tableauContainer = soup.findAll("div", { "class": "tableauPlaceholder"})[1]
urlPath = tableauContainer.find("param", { "name": "name"})["value"]

r = requests.get(
    f"https://public.tableau.com/views/{urlPath}",
    params= {
        ":showVizHome":"no",
    }
)
soup = BeautifulSoup(r.text, "html.parser")

tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)

dataUrl = f'https://public.tableau.com{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'

r = requests.post(dataUrl, data= {
    "sheet_id": tableauData["sheetId"],
})

dataReg = re.search('\d+;({.*})\d+;({.*})', r.text, re.MULTILINE)
info = json.loads(dataReg.group(1))
data = json.loads(dataReg.group(2))

print(data["secondaryInfo"]["presModelMap"]["dataDictionary"]["presModelHolder"]["genDataDictionaryPresModel"]["dataSegments"]["0"]["dataColumns"])

从那里您拥有所有数据.您将需要寻找拆分数据的方式,因为看起来所有数据都是通过单个列表转储的.可能需要查看JSON对象中的其他字段.

From there you have all the data. You will need to look for the way the data is splitted as it seems all the data is dumped through a single list. Probably looking at the other fields in the JSON object would be useful for that.

这篇关于如何从嵌入在网页中的Tableau图形中抓取工具提示值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆