从从 Tableau 画布动态加载的页面中抓取与冠状病毒相关的数据(我认为......) [英] Scraping coronavirus-related data from a page which is dynamically loaded from a Tableau canvas (I think...)

查看:63
本文介绍了从从 Tableau 画布动态加载的页面中抓取与冠状病毒相关的数据(我认为......)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我会很高兴发现这个问题是重复的,但如果是这样 - 我找不到那个问答.

有这个神秘页面来自 纽约州卫生部 包含按县和年龄组划分的死亡人数".正如标题所暗示的,它包含两个表(按县"/按年龄组").

出于某种奇怪的原因,此页面上的数据是超级安全的.无法选择,无法保存页面,无法打印.数据不在页面源上.我还尝试(但失败)检查 xhr 调用数据.

显然,requests 和beautifulsoup 无法处理.我尝试了通常的 Selenium 咒语(所以,除非我被告知,否则我不会用我尝试过的"片段来混淆这个问题).

期望输出:来自这两个表的数据,任何可以想象的格式.

我唯一能想到的就是截图并尝试对图像进行 ocr...

我不知道是 Selenium、Tableau、纽约州卫生局还是我自己,但是是时候召集重炮了......

解决方案

让我为你解释一下场景:

  1. 网站在参数 X-Session-Id 后面生成一个 session id,一旦您访问 主页 页面索引.所以我通过 GET 请求调用它,我从 headers 响应中获取它.
  2. 我发现了一个 POST 请求,它在你点击你想要的 url 之前自动生成,它实际上使用了 session 我们之前收集的 id.这是https://covid19tracker.health.ny.gov/vizql/w/NYS-COVID19-Tracker/v/NYSDOHCOVID-19Tracker-Fatalities/clear/sessions/{session id}

  3. 现在我们可以调用您的目标,即 https://covid19tracker.health.ny.gov/views/NYS-COVID19-Tracker/NYSDOHCOVID-19Tracker-Fatalities?%3Aembed=yes&%3Atoolbar=no&%3Atabs=n.

  4. 现在我注意到另一个 XHRback-end API 的请求.但是在调用之前,我们将解析 HTML 内容以获取 time 对象,该对象负责从API 所以我们会得到一个即时数据(实际上它就像一个实时聊天).在我们的例子中,它位于 HTML

  5. 内的 lastUpdatedAt 后面
  6. 我还注意到,我们需要获取从我们之前的 POST 请求中生成的最近的 X-Session-Id.

    立>
  7. 现在我们将使用我们拾取的 session 拨打 https://covid19tracker.health.ny.gov/vizql/w/NYS-COVID19-Tracker/v/NYSDOHCOVID-19Tracker-Fatalities/bootstrapSession/sessions/{session}

现在我们收到了完整的回复.你可以解析它或者做任何你想做的事情.

导入请求进口重新数据 = {'worksheetPortSize': '{"w":1536,"h":1250}','dashboardPortSize': '{"w":1536,"h":1250}','clientDimension': '{"w":1536,"h":349}','renderMapsClientSide': 'true','isBrowserRendering': '真','浏览器渲染阈值':'100','formatDataValueLocally': 'false','clientNum': '','navType': '重新加载','navSrc': '顶部','设备像素比':'2.5','clientRenderPixelLimit': '25000000','allowAutogenWorksheetPhoneLayouts': 'true','sheet_id': 'NYSDOH%20COVID-19%20Tracker%20-%20Fatalities','showParams': '{"checkpoint":false,"refresh":false,"refreshUnmodified":false}','filterTileSize': '200','locale': 'en_US','语言': 'en','verboseMode': '假',':session_feature_flags': '{}','钥匙串_版本':'1'}定义主(网址):使用 requests.Session() 作为请求:r = req.post(url)sid = r.headers.get("X-Session-Id")r = req.post(f"https://covid19tracker.health.ny.gov/vizql/w/NYS-COVID19-Tracker/v/NYSDOHCOVID-19Tracker-Fatalities/clear/sessions/{sid}")r = req.get("https://covid19tracker.health.ny.gov/views/NYS-COVID19-Tracker/NYSDOHCOVID-19Tracker-Fatalities?%3Aembed=yes&%3Atoolbar=no&%3Atabs=n")match = re.search(r"lastUpdatedAt.+?(\d+),", r.text).group(1)time = '{"featureFlags":"{\"MetricsAuthoringBeta\":false}","isAuthoring":false,"isOfflineMode":false,"lastUpdatedAt":xxx,"workbookId":9}'.replace('xxx', f"{匹配}")数据['stickySessionKey'] = 时间nid = r.headers.get("X-Session-Id")r = req.post(fhttps://covid19tracker.health.ny.gov/vizql/w/NYS-COVID19-Tracker/v/NYSDOHCOVID-19Tracker-Fatalities/bootstrapSession/sessions/{nid}",数据=数据)打印(r.text)main("https://covid19tracker.health.ny.gov")

I will be more than happy to find out this question is a duplicate, but if so - I can't find that Q&A.

There is this mysterious page from the New York State Department of Health containing "Fatalities by County and Age Group". As the title implies, it contains two tables ("By County"/"By Age Group").

For some strange reason, the data on this page is super-secured. It can't be selected, the page can't be saved and it can't be printed. The data isn't on the page source. I also tried (and failed) to inspect xhr calls for the data.

Obviously, requests and beautifulsoup can't handle it. I tried the usual Selenium incantations (so, unless I'm told otherwise, I won't clutter this question with "what I tried" snippets).

Desire output: the data from those two tables, in any conceivable format.

The only thing I can think of is to take a screenshot and try to ocr the image...

I don't know if it's Selenium, Tableau, the NYS Dep't of Health or just me, but it's time to call in the heavy artillery...

解决方案

Let me explain for you the scenario:

  1. Website is generating a session id behind that parameter X-Session-Id which is dynamically generated once you visit the main page page index. So i called it via GET request and I've picked it up from the headers response.
  2. I've figured out an POST request which is automatically generated before you hit your desired url which is actually using the session id which we collected before. here is it https://covid19tracker.health.ny.gov/vizql/w/NYS-COVID19-Tracker/v/NYSDOHCOVID-19Tracker-Fatalities/clear/sessions/{session id}

  3. Now we can call your target which is https://covid19tracker.health.ny.gov/views/NYS-COVID19-Tracker/NYSDOHCOVID-19Tracker-Fatalities?%3Aembed=yes&%3Atoolbar=no&%3Atabs=n.

  4. Now I noticed another XHR request to the back-end API. But before we do the call, We will parse the HTML content for picking up the time object which is responsible on generating the data freshly from the API so we will get an instant data (consider it like a live chat actually). in our case it's behind lastUpdatedAt inside the HTML

  5. I noticed as well that we will need to pickup the recent X-Session-Id generated from our previous POST request.

  6. Now we will make the call using our picked up session to https://covid19tracker.health.ny.gov/vizql/w/NYS-COVID19-Tracker/v/NYSDOHCOVID-19Tracker-Fatalities/bootstrapSession/sessions/{session}

Now we have received the full response. you can parse it or do whatever you want.

import requests
import re


data = {
    'worksheetPortSize': '{"w":1536,"h":1250}',
    'dashboardPortSize': '{"w":1536,"h":1250}',
    'clientDimension': '{"w":1536,"h":349}',
    'renderMapsClientSide': 'true',
    'isBrowserRendering': 'true',
    'browserRenderingThreshold': '100',
    'formatDataValueLocally': 'false',
    'clientNum': '',
    'navType': 'Reload',
    'navSrc': 'Top',
    'devicePixelRatio': '2.5',
    'clientRenderPixelLimit': '25000000',
    'allowAutogenWorksheetPhoneLayouts': 'true',
    'sheet_id': 'NYSDOH%20COVID-19%20Tracker%20-%20Fatalities',
    'showParams': '{"checkpoint":false,"refresh":false,"refreshUnmodified":false}',
    'filterTileSize': '200',
    'locale': 'en_US',
    'language': 'en',
    'verboseMode': 'false',
    ':session_feature_flags': '{}',
    'keychain_version': '1'
}


def main(url):
    with requests.Session() as req:
        r = req.post(url)
        sid = r.headers.get("X-Session-Id")

        r = req.post(
            f"https://covid19tracker.health.ny.gov/vizql/w/NYS-COVID19-Tracker/v/NYSDOHCOVID-19Tracker-Fatalities/clear/sessions/{sid}")

        r = req.get(
            "https://covid19tracker.health.ny.gov/views/NYS-COVID19-Tracker/NYSDOHCOVID-19Tracker-Fatalities?%3Aembed=yes&%3Atoolbar=no&%3Atabs=n")

        match = re.search(r"lastUpdatedAt.+?(\d+),", r.text).group(1)

        time = '{"featureFlags":"{\"MetricsAuthoringBeta\":false}","isAuthoring":false,"isOfflineMode":false,"lastUpdatedAt":xxx,"workbookId":9}'.replace(
            'xxx', f"{match}")

        data['stickySessionKey'] = time
        nid = r.headers.get("X-Session-Id")

        r = req.post(
            f"https://covid19tracker.health.ny.gov/vizql/w/NYS-COVID19-Tracker/v/NYSDOHCOVID-19Tracker-Fatalities/bootstrapSession/sessions/{nid}", data=data)

        print(r.text)


main("https://covid19tracker.health.ny.gov")

这篇关于从从 Tableau 画布动态加载的页面中抓取与冠状病毒相关的数据(我认为......)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆