网络抓取冠状病毒互动图 [英] Web scrape coronavirus interactive plots

查看:35
本文介绍了网络抓取冠状病毒互动图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取与 COVID-19 相关的数据.我能够从网站下载一些数据,例如,总病例数,但不能从交互式图表中下载数据.

I am trying to scrape data related to COVID-19. I was able to download some data from the website, for example, the number of total cases, but not data from interactive graphs.

我通常通过在检查元素页面中从网络"中查找源来使用 json 抓取交互式图形.但是,我无法找到用于抓取交互式图形的网络".

I usually scrape interactive graphs with json by finding the source from 'network' in the inspect element page. However, I was not able to find the 'network' for the interactive graphs to scrape.

有人可以帮我从总死亡人数"图表中抓取数据吗?或网站上的任何其他图表.谢谢.

Can someone please help me to scrape data from "total deaths" graph? or any other graph from the website. Thanks.

只是为了说清楚.我不想从国家表中抓取数据.我已经这样做了.我想要做的是从图表中获取数据.例如,来自死亡率图表与日期或活跃病例与时间日期图表的数据.

Just to make it clear. I don't want to scrape data from the table of Countries. I already did that. What I want to do is get data from the graphs. For example, data from death ratio graph vs date or active cases vs time date graph.

谢谢

import requests
import urllib.request
import time
import json
from bs4 import BeautifulSoup 
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

url= 'https://www.worldometers.info/coronavirus/#countries'
r = requests.get(url)
soup= BeautifulSoup(r.text, "html.parser")

例如,受影响国家的数量:

for example, Number of effected countries:

len(soup.find_all('table',{'id':'main_table_countries'})[0].find('tbody').find_all('tr'))

网站

推荐答案

正如我在评论中提到的,带有 Highcharts.chart(....) 的 JavaScript 很少,所以我尝试使用不同的方法.

As I mentioned in comment there are few JavaScripts with Highcharts.chart(....) so I tried to get values using different method.

他们中的大多数需要在数据中手动查找元素并创建正确的索引或 xpath 来获取值.所以没那么容易.

Most of them need to find elements manually in data and create correct indexes or xpath to get value. So it is not so easy.

最简单的是使用我在@AyushGarg 链接中看到的 js2xml 我可以从 highcharts.js 中抓取原始数据吗?

The easiest was to use js2xml which I saw in @AyushGarg link Can I scrape the raw data from highcharts.js?

最难的是使用pyjsparser

import requests
from bs4 import BeautifulSoup 
import json
#import dirtyjson
import pyjsparser
import js2xml

url= 'https://www.worldometers.info/coronavirus/#countries'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")

scripts = soup.find_all('script')

script = scripts[24].text
print(script)

print('\n--- pyjsparser ---\n')
data = pyjsparser.parse(script) 
data = data['body'][0]['expression']['arguments'][1]['properties'][-2]['value']['elements'][0]['properties'][-1]['value']['elements'] # a lot of work to find it
#print(json.dumps(data, indent=2))
data = [x['value'] for x in data]
print(data)
# text values
# it needs work

print('\n--- eval ---\n')
data = script.split('data: [', 1)[1].split(']', 1)[0]
data = eval(data)  # it creates tuple
print(data)
# text values
data = script.split("title: { text: '", 1)[-1].split("'", 1)[0]
print(data)
data = script.split("title: { text: '", 3)[-1].split("'", 1)[0]
print(data)

print('\n--- json ---\n')
data = script.split('data: [', 1)[1].split(']', 1)[0]
data = '[' + data + ']' # create correct JSON data
data = json.loads(data) # this time doesn't need `dirtyjson`
print(data)
# text values
data = script.split("title: { text: '", 1)[-1].split("'", 1)[0]
print(data)
data = script.split("title: { text: '", 3)[-1].split("'", 1)[0]
print(data)

print('\n--- js2xml ---\n')
data = js2xml.parse(script)
print(data.xpath('//property[@name="data"]//number/@value')) # nice and short xpath
# text values
#print(js2xml.pretty_print(data.xpath('//property[@name="title"]')[0]))
text = data.xpath('//property[@name="title"]//string/text()')
print(text[0])
print(text[1])

结果

 Highcharts.chart('coronavirus-deaths-linear', { chart: { type: 'line' }, title: { text: 'Total Deaths' }, subtitle: { text: '(Linear Scale)' }, xAxis: { categories: ["Jan 22","Jan 23","Jan 24","Jan 25","Jan 26","Jan 27","Jan 28","Jan 29","Jan 30","Jan 31","Feb 01","Feb 02","Feb 03","Feb 04","Feb 05","Feb 06","Feb 07","Feb 08","Feb 09","Feb 10","Feb 11","Feb 12","Feb 13","Feb 14","Feb 15","Feb 16","Feb 17","Feb 18","Feb 19","Feb 20","Feb 21","Feb 22","Feb 23","Feb 24","Feb 25","Feb 26","Feb 27","Feb 28","Feb 29","Mar 01","Mar 02","Mar 03","Mar 04","Mar 05"] }, yAxis: { title: { text: 'Total Coronavirus Deaths' } }, legend: { layout: 'vertical', align: 'right', verticalAlign: 'middle' }, credits: { enabled: false }, series: [{ name: 'Deaths', color: '#FF9900', lineWidth: 5, data: [17,25,41,56,80,106,132,170,213,259,304,362,426,492,565,638,724,813,910,1018,1115,1261,1383,1526,1669,1775,1873,2009,2126,2247,2360,2460,2618,2699,2763,2800,2858,2923,2977,3050,3117,3202,3285,3387] }], responsive: { rules: [{ condition: { maxWidth: 800 }, chartOptions: { legend: { layout: 'horizontal', align: 'center', verticalAlign: 'bottom' } } }] } }); 

--- pyjsparser ---

[17.0, 25.0, 41.0, 56.0, 80.0, 106.0, 132.0, 170.0, 213.0, 259.0, 304.0, 362.0, 426.0, 492.0, 565.0, 638.0, 724.0, 813.0, 910.0, 1018.0, 1115.0, 1261.0, 1383.0, 1526.0, 1669.0, 1775.0, 1873.0, 2009.0, 2126.0, 2247.0, 2360.0, 2460.0, 2618.0, 2699.0, 2763.0, 2800.0, 2858.0, 2923.0, 2977.0, 3050.0, 3117.0, 3202.0, 3285.0, 3387.0]

--- eval ---

(17, 25, 41, 56, 80, 106, 132, 170, 213, 259, 304, 362, 426, 492, 565, 638, 724, 813, 910, 1018, 1115, 1261, 1383, 1526, 1669, 1775, 1873, 2009, 2126, 2247, 2360, 2460, 2618, 2699, 2763, 2800, 2858, 2923, 2977, 3050, 3117, 3202, 3285, 3387)
Total Deaths
Total Coronavirus Deaths

--- json ---

[17, 25, 41, 56, 80, 106, 132, 170, 213, 259, 304, 362, 426, 492, 565, 638, 724, 813, 910, 1018, 1115, 1261, 1383, 1526, 1669, 1775, 1873, 2009, 2126, 2247, 2360, 2460, 2618, 2699, 2763, 2800, 2858, 2923, 2977, 3050, 3117, 3202, 3285, 3387]
Total Deaths
Total Coronavirus Deaths

--- js2xml ---

['17', '25', '41', '56', '80', '106', '132', '170', '213', '259', '304', '362', '426', '492', '565', '638', '724', '813', '910', '1018', '1115', '1261', '1383', '1526', '1669', '1775', '1873', '2009', '2126', '2247', '2360', '2460', '2618', '2699', '2763', '2800', '2858', '2923', '2977', '3050', '3117', '3202', '3285', '3387']
Total Deaths
Total Coronavirus Deaths

<小时>

页面改变了结构,代码也需要一些改变


Page changed structure and code needed some changes too

import requests
from bs4 import BeautifulSoup 
import json
#import dirtyjson
import js2xml
import pyjsparser

# --- functions ---

def test_eval(script):
    print('\n--- eval ---\n')

    # chart values
    text = script.split('data: [', 1)[1] # beginning
    text = text.split(']', 1)[0] # end
    values = eval(text)  # it creates tuple
    print(values)

    # title 
    # I split `yAxis` because there is other `title` without text
    # I split beginning in few steps because text may have different indentations (different number of spaces)
    # (you could use regex to split in one step)
    text = script.split("title: {\n", 1)[1] # beginning
    text = text.split("text: '", 1)[1] # beginning
    title = text.split("'", 1)[0] # end
    print('\ntitle:', title)

    text = script.split("yAxis: {\n", 1)[1] # beginning
    text = text.split("title: {\n", 1)[1] # beginning
    text = text.split("text: '", 1)[1] # beginning
    title = text.split("'", 1)[0] # end
    print('\ntitle:', title)

def test_json(script):
    print('\n--- json ---\n')

    # chart values
    text = script.split('data: [', 1)[1] # beginning
    text = text.split(']', 1)[0] # end
    text = '[' + text + ']' # create correct JSON data
    values = json.loads(text) # this time doesn't need `dirtyjson`
    print(values)

    # title
    # I split `yAxis` because there is other `title` without text
    # I split beginning in few steps because text may have different indentations (different number of spaces)
    # (you could use regex to split in one step)
    text = script.split("title: {\n", 1)[1] # beginning
    text = text.split("text: '", 1)[1] # beginning
    title = text.split("'", 1)[0] # end
    print('\ntitle:', title)

    text = script.split("yAxis: {\n", 1)[1] # beginning
    text = text.split("title: {\n", 1)[1] # beginning
    text = text.split("text: '", 1)[1] # beginning
    title = text.split("'", 1)[0] # end
    print('\ntitle:', title)

def test_js2xml(script):
    print('\n--- js2xml ---\n')

    data = js2xml.parse(script)

    # chart values (short and nice path)
    values = data.xpath('//property[@name="data"]//number/@value')
    #values = [int(x) for x in values] # it may need to convert to int() or float()
    #values = [float(x) for x in values] # it may need to convert to int() or float()
    print(values)

    # title (short and nice path)
    #print(js2xml.pretty_print(data.xpath('//property[@name="title"]')[0]))
    #title = data.xpath('//property[@name="title"]//string/text()')
    #print(js2xml.pretty_print(data.xpath('//property[@name="yAxis"]//property[@name="title"]')[0]))

    title = data.xpath('//property[@name="title"]//string/text()')
    title = title[0]
    print('\ntitle:', title)

    title = data.xpath('//property[@name="yAxis"]//property[@name="title"]//string/text()')
    title = title[0]
    print('\ntitle:', title)

def test_pyjsparser(script):
    print('\n--- pyjsparser ---\n')

    data = pyjsparser.parse(script)

    print("body's number:", len(data['body']))

    for number, body in enumerate(data['body']):
        if (body['type'] == 'ExpressionStatement'
            and body['expression']['callee']['object']['name'] == 'Highcharts'
            and len(body['expression']['arguments']) > 1):

            arguments = body['expression']['arguments']
            #print(json.dumps(values, indent=2))
            for properties in arguments[1]['properties']:
                #print('name: >{}<'.format(p['key']['name']))
                if properties['key']['name'] == 'series':
                    values = properties['value']['elements'][0]
                    values = values['properties'][-1]
                    values = values['value']['elements'] # a lot of work to find it
                    #print(json.dumps(values, indent=2))

                    values = [x['value'] for x in values]
                    print(values)

    # title (very complicated path) 
    # It needs more work to find correct indexes to get title
    # so I skip this part as too complex.

# --- main ---

url= 'https://www.worldometers.info/coronavirus/#countries'

r = requests.get(url)
#print(r.text)
soup = BeautifulSoup(r.text, "html.parser")

all_scripts = soup.find_all('script')
print('number of scripts:', len(all_scripts))

for number, script in enumerate(all_scripts):

    #if 'data: [' in script.text:
    if 'Highcharts.chart' in script.text:
        print('\n=== script:', number, '===\n')
        test_eval(script.text)
        test_json(script.text)
        test_js2xml(script.text)
        test_pyjsparser(script.text)

我从我的博客中获取它:抓取:如何从 HighCharts 创建的交互式绘图中获取数据

I take it from my blog: Scraping: How to get data from interactive plot created with HighCharts

这篇关于网络抓取冠状病毒互动图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆