从从 Tableau 画布动态加载的页面中抓取与冠状病毒相关的数据(我认为......) [英] Scraping coronavirus-related data from a page which is dynamically loaded from a Tableau canvas (I think...)
问题描述
我会很高兴发现这个问题是重复的,但如果是这样 - 我找不到那个问答.
有这个神秘页面来自 纽约州卫生部 包含按县和年龄组划分的死亡人数".正如标题所暗示的,它包含两个表(按县"/按年龄组").
出于某种奇怪的原因,此页面上的数据是超级安全的.无法选择,无法保存页面,无法打印.数据不在页面源上.我还尝试(但失败)检查 xhr 调用数据.
显然,requests 和beautifulsoup 无法处理.我尝试了通常的 Selenium 咒语(所以,除非我被告知,否则我不会用我尝试过的"片段来混淆这个问题).
期望输出:来自这两个表的数据,任何可以想象的格式.
我唯一能想到的就是截图并尝试对图像进行 ocr...
我不知道是 Selenium、Tableau、纽约州卫生局还是我自己,但是是时候召集重炮了......
让我为你解释一下场景:
- 网站在参数
X-Session-Id
后面生成一个session
id,一旦您访问 主页 页面索引.所以我通过GET
请求调用它,我从headers
响应中获取它. 我发现了一个
POST
请求,它在你点击你想要的url
之前自动生成,它实际上使用了session
我们之前收集的 id.这是https://covid19tracker.health.ny.gov/vizql/w/NYS-COVID19-Tracker/v/NYSDOHCOVID-19Tracker-Fatalities/clear/sessions/{session id}
>现在我们可以调用您的目标,即
https://covid19tracker.health.ny.gov/views/NYS-COVID19-Tracker/NYSDOHCOVID-19Tracker-Fatalities?%3Aembed=yes&%3Atoolbar=no&%3Atabs=n
.现在我注意到另一个
XHR
对back-end
API
的请求.但是在调用之前,我们将解析HTML
内容以获取time
对象,该对象负责从API
所以我们会得到一个即时数据(实际上它就像一个实时聊天).在我们的例子中,它位于HTML
内的 我还注意到,我们需要获取从我们之前的
立>POST
请求中生成的最近的X-Session-Id
.现在我们将使用我们拾取的
session
拨打https://covid19tracker.health.ny.gov/vizql/w/NYS-COVID19-Tracker/v/NYSDOHCOVID-19Tracker-Fatalities/bootstrapSession/sessions/{session}
lastUpdatedAt
后面现在我们收到了完整的回复.你可以解析它或者做任何你想做的事情.
导入请求进口重新数据 = {'worksheetPortSize': '{"w":1536,"h":1250}','dashboardPortSize': '{"w":1536,"h":1250}','clientDimension': '{"w":1536,"h":349}','renderMapsClientSide': 'true','isBrowserRendering': '真','浏览器渲染阈值':'100','formatDataValueLocally': 'false','clientNum': '','navType': '重新加载','navSrc': '顶部','设备像素比':'2.5','clientRenderPixelLimit': '25000000','allowAutogenWorksheetPhoneLayouts': 'true','sheet_id': 'NYSDOH%20COVID-19%20Tracker%20-%20Fatalities','showParams': '{"checkpoint":false,"refresh":false,"refreshUnmodified":false}','filterTileSize': '200','locale': 'en_US','语言': 'en','verboseMode': '假',':session_feature_flags': '{}','钥匙串_版本':'1'}定义主(网址):使用 requests.Session() 作为请求:r = req.post(url)sid = r.headers.get("X-Session-Id")r = req.post(f"https://covid19tracker.health.ny.gov/vizql/w/NYS-COVID19-Tracker/v/NYSDOHCOVID-19Tracker-Fatalities/clear/sessions/{sid}")r = req.get("https://covid19tracker.health.ny.gov/views/NYS-COVID19-Tracker/NYSDOHCOVID-19Tracker-Fatalities?%3Aembed=yes&%3Atoolbar=no&%3Atabs=n")match = re.search(r"lastUpdatedAt.+?(\d+),", r.text).group(1)time = '{"featureFlags":"{\"MetricsAuthoringBeta\":false}","isAuthoring":false,"isOfflineMode":false,"lastUpdatedAt":xxx,"workbookId":9}'.replace('xxx', f"{匹配}")数据['stickySessionKey'] = 时间nid = r.headers.get("X-Session-Id")r = req.post(fhttps://covid19tracker.health.ny.gov/vizql/w/NYS-COVID19-Tracker/v/NYSDOHCOVID-19Tracker-Fatalities/bootstrapSession/sessions/{nid}",数据=数据)打印(r.text)main("https://covid19tracker.health.ny.gov")
I will be more than happy to find out this question is a duplicate, but if so - I can't find that Q&A.
There is this mysterious page from the New York State Department of Health containing "Fatalities by County and Age Group". As the title implies, it contains two tables ("By County"/"By Age Group").
For some strange reason, the data on this page is super-secured. It can't be selected, the page can't be saved and it can't be printed. The data isn't on the page source. I also tried (and failed) to inspect xhr calls for the data.
Obviously, requests and beautifulsoup can't handle it. I tried the usual Selenium incantations (so, unless I'm told otherwise, I won't clutter this question with "what I tried" snippets).
Desire output: the data from those two tables, in any conceivable format.
The only thing I can think of is to take a screenshot and try to ocr the image...
I don't know if it's Selenium, Tableau, the NYS Dep't of Health or just me, but it's time to call in the heavy artillery...
Let me explain for you the scenario:
- Website is generating a
session
id behind that parameterX-Session-Id
which is dynamically generated once you visit the main page page index. So i called it viaGET
request and I've picked it up from theheaders
response. I've figured out an
POST
request which is automatically generated before you hit your desiredurl
which is actually using thesession
id which we collected before. here is ithttps://covid19tracker.health.ny.gov/vizql/w/NYS-COVID19-Tracker/v/NYSDOHCOVID-19Tracker-Fatalities/clear/sessions/{session id}
Now we can call your target which is
https://covid19tracker.health.ny.gov/views/NYS-COVID19-Tracker/NYSDOHCOVID-19Tracker-Fatalities?%3Aembed=yes&%3Atoolbar=no&%3Atabs=n
.Now I noticed another
XHR
request to theback-end
API
. But before we do the call, We will parse theHTML
content for picking up thetime
object which is responsible on generating the datafreshly
from theAPI
so we will get an instant data (consider it like a live chat actually). in our case it's behindlastUpdatedAt
inside theHTML
I noticed as well that we will need to pickup the recent
X-Session-Id
generated from our previousPOST
request.Now we will make the call using our picked up
session
tohttps://covid19tracker.health.ny.gov/vizql/w/NYS-COVID19-Tracker/v/NYSDOHCOVID-19Tracker-Fatalities/bootstrapSession/sessions/{session}
Now we have received the full response. you can parse it or do whatever you want.
import requests
import re
data = {
'worksheetPortSize': '{"w":1536,"h":1250}',
'dashboardPortSize': '{"w":1536,"h":1250}',
'clientDimension': '{"w":1536,"h":349}',
'renderMapsClientSide': 'true',
'isBrowserRendering': 'true',
'browserRenderingThreshold': '100',
'formatDataValueLocally': 'false',
'clientNum': '',
'navType': 'Reload',
'navSrc': 'Top',
'devicePixelRatio': '2.5',
'clientRenderPixelLimit': '25000000',
'allowAutogenWorksheetPhoneLayouts': 'true',
'sheet_id': 'NYSDOH%20COVID-19%20Tracker%20-%20Fatalities',
'showParams': '{"checkpoint":false,"refresh":false,"refreshUnmodified":false}',
'filterTileSize': '200',
'locale': 'en_US',
'language': 'en',
'verboseMode': 'false',
':session_feature_flags': '{}',
'keychain_version': '1'
}
def main(url):
with requests.Session() as req:
r = req.post(url)
sid = r.headers.get("X-Session-Id")
r = req.post(
f"https://covid19tracker.health.ny.gov/vizql/w/NYS-COVID19-Tracker/v/NYSDOHCOVID-19Tracker-Fatalities/clear/sessions/{sid}")
r = req.get(
"https://covid19tracker.health.ny.gov/views/NYS-COVID19-Tracker/NYSDOHCOVID-19Tracker-Fatalities?%3Aembed=yes&%3Atoolbar=no&%3Atabs=n")
match = re.search(r"lastUpdatedAt.+?(\d+),", r.text).group(1)
time = '{"featureFlags":"{\"MetricsAuthoringBeta\":false}","isAuthoring":false,"isOfflineMode":false,"lastUpdatedAt":xxx,"workbookId":9}'.replace(
'xxx', f"{match}")
data['stickySessionKey'] = time
nid = r.headers.get("X-Session-Id")
r = req.post(
f"https://covid19tracker.health.ny.gov/vizql/w/NYS-COVID19-Tracker/v/NYSDOHCOVID-19Tracker-Fatalities/bootstrapSession/sessions/{nid}", data=data)
print(r.text)
main("https://covid19tracker.health.ny.gov")
这篇关于从从 Tableau 画布动态加载的页面中抓取与冠状病毒相关的数据(我认为......)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!