从交互式图表中抓取数据 [英] Web scraping data from an interactive chart

查看:122
本文介绍了从交互式图表中抓取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

可以在此网页(对不起,网站需要登录)?

Would it be possible to get the data behind the interactive chart in this webpage (sorry, website requires login)?

当鼠标悬停在图表上时,数据显示,

When I hover over the chart with a mouse, the data shows up, but how do I get that data?

以下是该网站HTML源代码的摘要:

Here's an extract of the HTML source code from that website:

<svg height="460" version="1.1" width="1037" xmlns="http://www.w3.org/2000/svg" style="overflow: hidden; position: relative; left: -0.5px;">
<desc>Created with Raphaël 2.1.0</desc>
<defs>

<path style="" fill="none" stroke="#f1f1f1" d="M20,130L1017,130M20,159.66666666666666L1017,159.66666666666666M20,189.33333333333331L1017,189.33333333333331M20,219L1017,219M20,248.66666666666666L1017,248.66666666666666M20,278.3333333333333L1017,278.3333333333333M20,308L1017,308">
<path style="" fill="none" stroke="#f1f1f1" d="M295.0344827586207,130L295.0344827586207,337.66666666666663M295.0344827586207,365L295.0344827586207,415M535.6896551724138,130L535.6896551724138,337.66666666666663M535.6896551724138,365L535.6896551724138,415M776.3448275862069,130L776.3448275862069,337.66666666666663M776.3448275862069,365L776.3448275862069,415M1017,130L1017,337.66666666666663M1017,365L1017,415">
<path style="" fill="none" stroke="#cccccc" d="M17,337.66666666666663L1018,337.66666666666663">
<path style="" fill="none" stroke="#cccccc" d="M17,365L1018,365">
<rect x="20" y="130" width="997" height="207.66666666666666" r="0" rx="0" ry="0" fill="#ff0000" stroke="none" style="opacity: 0;" opacity="0">
<path style="" fill="none" stroke="#6e87d7" d="M20,281.030303030303L54.37931034482759,316.6902356902357L88.75862068965517,318.78787878787875L123.13793103448276,318.78787878787875L157.51724137931035,318.78787878787875L191.89655172413794,312.4949494949495L226.27586206896552,285.2255892255892L260.65517241379314,312.4949494949495L295.0344827586207,314.59259259259255L329.41379310344826,316.6902356902357L363.7931034482759,297.8114478114478L398.1724137931035,318.78787878787875L432.55172413793105,335.56902356902356L466.9310344827586,293.61616161616155L501.3103448275862,276.8350168350168L535.6896551724138,272.6397306397306L570.0689655172414,274.7373737373737L604.448275862069,272.6397306397306L638.8275862068965,216.00336700336698L673.2068965517242,216.00336700336698L707.5862068965517,239.07744107744105L741.9655172413793,281.030303030303L776.344827586207,144.68350168350165L810.7241379310345,245.37037037037032L845.1034482758621,239.07744107744105L879.4827586206897,247.46801346801345L913.8620689655172,245.37037037037032L948.2413793103449,245.37037037037032L982.6206896551724,207.61279461279457L1017,163.56228956228955" stroke-width="2">
<path style="" fill="none" stroke="#f1f1f1" d="M20,390L1017,390M20,415L1017,415">
<path style="opacity: 

推荐答案

解析这些信息(并从你的标签猜测,你想在python中这样做)。但是,快速浏览 Raphael文档,我相当确定你可以用另一种更快的方式获取数据:数据必须以javascript数组的形式存在,请先尝试查找。

You would have to parse that information (and guessing from your tags, you'll want to do this in python). However, having had a quick look at the Raphael documentation, I'm fairly sure you can get the data in another, quicker way: the data has to exist as a javascript array somewhere. Try looking for that first.

最后,从这个javascript数据,你发现的SVG是生成的。如果你看看 SVG Path element 说明,您将看到这些 M L 定义需要解释,然后你应该能够将这些行解析到你喜欢的(python)数据集。

Eventually, from this javascript data, the SVG you've found gets generated. If you look at the SVG Path element description, you'll see how those M and L definitions need to be interpreted and then you should be capable of parsing those lines into the (python) dataset you like.

但是,我想再次我们很难找到你所寻找的,甚至没有图片继续(它是一个直方图,是一个线图?)。 L 绘制的线条可以是你需要的。

However, I want to state again that it is hard for us to find what you are looking for without even a picture to go on (is it a histogram, is it a linechart?). The lines that are being drawn with L could be all you need.

例如,在python会话中列出的第一个路径,您可以这样做:

As an example, if you take that first path you've listed in a python session, you could do this:

svg_string = "M20,130L1017,130M20,159.66666666666666L1017,159.66666666666666M20,189.33333333333331L1017,189.33333333333331M20,219L1017,219M20,248.66666666666666L1017,248.66666666666666M20,278.3333333333333L1017,278.3333333333333M20,308L1017,308"
import re
data = [map(float, xy.split(',')) for xy in re.split('[ML]', svg_string)[1:]]


b $ b

注意,这只能正常工作,因为 M ove和 L ine命令轮流这个字符串。但是它看起来像所有其他路径是以类似的方式生成的(这导致我更强烈地认为,数据集只是在一个javascript文件中,你还没有看过)。

Remark that this only works correctly, because the Move and Line commands take turns in this string. But it does look like all the other paths are generated in a similar fashion (which leads me to think more strongly that the dataset is just somewhere in a javascript file you haven't looked at yet).

最后,要获得此源代码,您应该使用 urllib2 进行程序化检索。

Finally, to obtain this sourcecode, you should look into using urllib2 for programmatic retrieval.

这篇关于从交互式图表中抓取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆