从网站上显示的图表中获取数据 [英] Getting data from a chart that is displayed on a website

查看:70
本文介绍了从网站上显示的图表中获取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我被要求画一个这样的图

使用Latex(更准确地说是tikz和/或pgf).如果我有数据,这将不是问题,但我没有.我所拥有的只是,但还有一些其他http客户端和一些不错的工具像机械化scrapy.我们将使用API​​的网址,复制的标头和从Firebug复制的发布数据来发出POST请求.该请求返回一个响应头和内容的元组.

 >>>导入 httplib2>>>h = httplib2.Http()>>>url ='http://www.google.com/transparencyreport/traffic/service'>>>resp,content = h.request(URL,'POST',body = body,headers = headers) 

按摩数据

原始格式真的很奇怪,似乎只有最高位包含数据点,因此我将放弃其余部分.

 >>>清除= content.split('")[0] [4:-1] +']' 

现在它是有效的JSON,因此我们可以将其反序列化为本地python数据类型.

 >>>导入json>>>数据= json.loads(清理) 

我感兴趣的所有点都是浮点数,因此我将以此为基础进行过滤.

 >>>数据= [如果类型(x)==浮点数,则表示数据中的x 

处理/保存数据

现在我们有了我们的数据,对其进行了检查,进行了其他处理,等等...

 >>>数据[:5]<<<[44.73874282836914,45.4061279296875,47.5350456237793,44.56114196777344,46.08817672729492] 

...或者只是保存它.

 >>>与open('data.json','w')为f:...:f.write(json.dumps(data)) 

我们还可以使用website from where graphs can be displayed, but I don't know how to get the data from there.

I spent the day today trying to get this data, including writing to Google and using a type of software which traces the line and infers the points of a graph, such as Datathief and DigitizeIt, but I was unsuccessful. I think the latter did not work because the lines in the graph are too thin and have more than one shade of blue. Of course, I tried to improve the picture quality using Paint and Gimp but I still couldn't make it work.

I also tried using eps2pgf, a Java script which transforms eps figures into pgf code, but even that was not working for the graphs I saved using Image Capture (mac) and Print Screen (Windows), and to be honest this would be my last option since it is a "brute force approach", spitting an ugly code that you can't really improve on.

After all that I decided to start learning Python, because my supervisor, the person who asked me to draw this picture using tikz, said that there is a Python code to get data from websites like this. Now I am not even sure Python will do the job (though I am happy for the excuse to learn it) and of course it takes time to learn a new language and do something like that, so I want to know whether there is really a way to get the data from that website, using preferably Python but if not, any other method.

解决方案

Well, it'd be great if Google provided an API for this data! That said, you can still scrape some data out of the site. Here's how to go about it...

Install Firebug

I prefer Firebug for Firefox, but Chrome's developer tools should also work.

Investigate First things first, let's visit the url in question and use Firebug try and see what's going on. Activate Firebug with F12 or go to Tools->Firebug->Open Firebug. Click on the Net tab first and reload the page. This shows all the requests made, and will give you some insight into how the site works. Usually flash plugins load data externally, as opposed to having it embedded in the actual plugin, and if you look at the requests you'll see request labeled POST service. If you hover over it, firebug shows the full url and you'll see the page made a request to http://www.google.com/transparencyreport/traffic/service. You can click on the request and look at the headers sent, the post data, the response and cookies used to perform the request.

If you look at the response, you'll see what appears to be malformed JSON. From what I can tell this appears to contain the list of normalized traffic data points. You could actually cut and paste the response out of firebug, but since this IS a python question, let's work a bit harder.

Getting the data into Python

To make the post request successfully, we'll need to do (nearly) everything the browser does. We can cheat a bit and just copy the request headers and post data out of firebug, to spoof a real request.

Headers & post data

Use triple quotes to paste multi-line strings into the shell. Copy the request headers and paste it in.

>>> headers = """ <paste headers> """

Next convert it to a dict for httplib2. I'm going to use a list comprehension (which splits the string based on newlines, then splits the line on the first : and strips trailing whitespace, which gives me a list of two-elemnt lists that dict can convert into a dictionary), but you could do this however you want. You could manually create the dict too, I just find this faster.

>>> headers = dict([[s.strip() for s in line.split(':', 1)]
                               for line in headers.strip().split('\n')])

And copy in the post data.

>>> body = """ <paste post data> """

Make the request I'm going to use httplib2 but there are a few other http clients and some nice tools for scraping the web like mechanize and scrapy. We'll make the POST request using the url to the API, the headers we copied and the post data we copied from firebug. The request returns a tuple of response headers and content.

>>> import httplib2 
>>> h = httplib2.Http()
>>> url = 'http://www.google.com/transparencyreport/traffic/service'
>>> resp, content = h.request(url, 'POST', body=body, headers=headers)

Massage Data

The original format is really weird and only the top bit seems to contain the data points, so I'll ditch the rest.

>>> cleaned = content.split("'")[0][4:-1] + ']' 

Now that it's valid JSON, so we can deserialize it into native python data types.

>>> import json
>>> data = json.loads(cleaned)

All of the points I'm interested in are floats, so I'll filter based on that.

>>> data = [x for x in data if type(x) == float]

Process/Save Data

Now that we have our data, inspect it, do additional processing, etc...

>>> data[:5] 
<<< 
[44.73874282836914,
 45.4061279296875,
 47.5350456237793,
 44.56114196777344,
 46.08817672729492]

...or just save it.

>>> with open('data.json', 'w') as f:
...:     f.write(json.dumps(data))

We could also plot it out using pyplot from matplotlib (or some other graphing/plotting library).

>>> import matplotlib.pyplot as plt
>>> plt.plot(data)

Conclusion

If you are just interested in a few things you can adjust the chart to display what you want and then use the request headers/post data used by the proper request to http://www.google.com/transparencyreport/traffic/service. You'll might want to inspect the actual response closer than I did, I just discarded the parts that didn't make sense to me. Hopefully they'll expose a public API for this data.

这篇关于从网站上显示的图表中获取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆