从交互式图形中收集数据 [英] Scraping data from interactive graph

查看:134
本文介绍了从交互式图形中收集数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有一个网站,其中有几个我想从中提取数据的交互式图表.我之前使用selenium webdriver在python中编写了一些Web抓取工具,但这似乎是一个不同的问题.我看过关于stackoverflow的几个类似问题.从这些看来,解决方案可能是直接从json文件下载数据.我查看了网站的源代码,并确定了几个json文件,但是经检查,它们似乎并不包含这些数据.

There is a website with a couple of interactive charts from which I would like to extract data. I've written a couple of web scrapers before in python using selenium webdriver, but this seems to be a different problem. I've looked at a couple of similar questions on stackoverflow. From those it seems that the solution could be to download data directly from a json file. I looked at the source code of the website and identified a couple of json files, but upon inspection they don't seem to contain the data.

有人知道如何从这些图中下载数据吗?我尤其对以下条形图感兴趣:.//*[@id='network_download']

Does anyone know how to download the data from those graphs? In particular I am interested in this bar chart: .//*[@id='network_download']

谢谢

edit:我应该补充一点,当我使用Firebug检查网站时,我发现有可能以以下格式获取数据.但这显然没有帮助,因为它不包含任何标签.

edit: I should add that when I inspected the website using Firebug I saw that itis possible to get data in the following format. But this is obviously not helpful as it doesn't include any labels.

<circle fill="#8CB1AA" cx="713.4318516666667" cy="5.357142857142858" r="4.5" style="opacity: 0.983087;">
<circle fill="#8CB1AA" cx="694.1212663333334" cy="10.714285714285715" r="4.5" style="opacity: 0.983087;">
<circle fill="#CEA379" cx="626.4726493333333" cy="16.071428571428573" r="4.5" style="opacity: 0.983087;">
<circle fill="#B0B359" cx="613.88416" cy="21.42857142857143" r="4.5" style="opacity: 0.983087;">
<circle fill="#D1D49E" cx="602.917665" cy="26.785714285714285" r="4.5" style="opacity: 0.983087;">
<circle fill="#A5E0B5" cx="581.5437366666666" cy="32.142857142857146" r="4.5" style="opacity: 0.983087;">

推荐答案

像这样的SVG图表很难抓取.直到您实际用鼠标悬停各个元素时,所需的数字才会显示.

SVG charts like this tend to be a bit tough to scrape. The numbers you want aren't displayed until you actually hover the individual elements with your mouse.

要获取所需的数据

  1. 找到所有点的列表
  2. 对于dots_list中的每个点,单击或悬停(动作链)点
  3. 抓取弹出的工具提示中的值

这对我有用:

from __future__ import print_function

from pprint import pprint as pp

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains


def main():
    driver = webdriver.Chrome()
    ac = ActionChains(driver)

    try:
        driver.get("https://opensignal.com/reports/2016/02/state-of-lte-q4-2015/")

        dots_css = "div#network_download g g.dots_container circle"
        dots_list = driver.find_elements_by_css_selector(dots_css)

        print("Found {0} data points".format(len(dots_list)))

        download_speeds = list()
        for index, _ in enumerate(dots_list, 1):
            # Because this is an SVG chart, and because we need to hover it,
            # it is very likely that the elements will go stale as we do this. For
            # that reason we need to require each dot element right before we click it
            single_dot_css = dots_css + ":nth-child({0})".format(index)
            dot = driver.find_element_by_css_selector(single_dot_css)
            dot.click()

            # Scrape the text from the popup
            popup_css = "div#network_download div.tooltip"
            popup_text = driver.find_element_by_css_selector(popup_css).text
            pp(popup_text)
            rank, comp_and_country, speed = popup_text.split("\n")
            company, country = comp_and_country.split(" in ")
            speed_dict = {
                "rank": rank.split(" Globally")[0].strip("#"),
                "company": company,
                "country": country,
                "speed": speed.split("Download speed: ")[1]
            }
            download_speeds.append(speed_dict)

            # Hover away from the tool tip so it clears
            hover_elem = driver.find_element_by_id("network_download")
            ac.move_to_element(hover_elem).perform()

        pp(download_speeds)

    finally:
        driver.quit()

if __name__ == "__main__":
    main()

样本输出:

(.venv35) ➜  stackoverflow python svg_charts.py
Found 182 data points
'#1 Globally\nSingTel in Singapore\nDownload speed: 40 Mbps'
'#2 Globally\nStarHub in Singapore\nDownload speed: 39 Mbps'
'#3 Globally\nSaskTel in Canada\nDownload speed: 35 Mbps'
'#4 Globally\nOrange in Israel\nDownload speed: 35 Mbps'
'#5 Globally\nolleh in South Korea\nDownload speed: 34 Mbps'
'#6 Globally\nVodafone in Romania\nDownload speed: 33 Mbps'
'#7 Globally\nVodafone in New Zealand\nDownload speed: 32 Mbps'
'#8 Globally\nTDC in Denmark\nDownload speed: 31 Mbps'
'#9 Globally\nT-Mobile in Hungary\nDownload speed: 30 Mbps'
'#10 Globally\nT-Mobile in Netherlands\nDownload speed: 30 Mbps'
'#11 Globally\nM1 in Singapore\nDownload speed: 29 Mbps'
'#12 Globally\nTelstra in Australia\nDownload speed: 29 Mbps'
'#13 Globally\nTelenor in Hungary\nDownload speed: 29 Mbps'
<...>
[{'company': 'SingTel',
  'country': 'Singapore',
  'rank': '1',
  'speed': '40 Mbps'},
 {'company': 'StarHub',
  'country': 'Singapore',
  'rank': '2',
  'speed': '39 Mbps'},
 {'company': 'SaskTel', 'country': 'Canada', 'rank': '3', 'speed': '35 Mbps'}
...
]

请注意,您在问题中引用的圆圈元素中的值并不是特别有用,因为这些值仅指定如何在SVG图表中绘制点.

It should be noted that the values you referenced in the question, in the circle elements, aren't particularly useful, as those just specify how to draw the dots within the SVG chart.

这篇关于从交互式图形中收集数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆