Python 3,Web抓取和Javascript [哦我的] [英] Python 3, Web-scraping, and Javascript [Oh My]

查看:150
本文介绍了Python 3,Web抓取和Javascript [哦我的]的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经到了使用Javascript和Python3进入网络抓取网页的混战。我很清楚我的靴子可能正在与一匹死马接触,但无论如何我觉得我还要画六枪。这是意大利面西部;是我的灰帽子吗?

I have come to the point of entering the melee on web-scraping webpages using Javascript, with Python3. I am well aware that my boot may be making contact with a dead horse, but I feel like drawing my six-shooter anyway. It's a spaghetti western; be my gray hat?

:: Backstory ::

我正在使用Python 3.2.3。

I am using Python 3.2.3.

我有兴趣收集YTD,1年,3年,5年的历史股票// etf // mutual_fund价格数据10 -yr ...和/或用户定义的股票,ETF或共同基金的类似时间范围。我在Morningstar.com上设置我的网站,因为他们倾向于提供尽可能多的数据而不一定需要登录;其他人,例如finance.google.com& c,他们提供的关于股票与etfs和共同基金的数据往往不一致。

I am interested in gathering historical stock//etf//mutual_fund price data for YTD, 1-yr, 3-yr, 5-yr 10-yr... and/or similar timeframes for a user-defined stock, etf, or mutual fund. I set my sites on Morningstar.com, as they tend to provide as much data as possible without necessarily requiring a log-in; other folks such as finance.google.com &c tend to be inconsistent in what data they provide regarding stocks vs etfs vs mutual funds.

使用权衡这个历史数据的晨星,或者他们称之为追踪总回报,是用于生成这些数据,他们使用Javascript。

The trade-off in using Morningstar for this historical data, or "Trailing Total Returns" as they call it, is that for producing this data they use Javascript.

以下是Morningstar的一些示例链接:

Here are some example links from Morningstar:

A Mutual基金;

ETF;

A Stock。

我对追踪退货部分感兴趣,数字排在第一行左上角在Javascript制作的图表中。

I am interested in the "Trailing Returns" portion, top row or so of numbers in the Javascript-produced chart.

::尝试到目前为止

我已经确认了wget不玩Javascript;即使下载所有相关文件[css,.js,& c]也不允许我在浏览器或脚本中本地渲染javascript。 StackOverflow上的研究证实了这一点。我愿意在这里得到纠正。

I've confirmed that wget doesn't play with Javascript; even downloading all of the associated files [css, .js, &c] hasn't allowed me to locally render the javascript in browser or in script. Research here on StackOverflow confirmed this. Am willing to be corrected here.

我的研究告诉我,Python3不存在Mechanize。无论如何我试过了,然后变成警察Javert喊道:我知道了!在错误消息模块不存在。

My research informed me that Mechanize doesn't exist for Python3. I tried anyway, and turned into Policeman Javert crying out "I knew it!" at the error message "module does not exist".

::我听说过...... ::

- > Selenium。但是,我的理解是,这需要Thy Favorite Browser实际打开一个网页,导航,然后不关闭,因为Selenium没有关闭此选项卡//窗口命令//选项。如果我// my_user想要​​获得许多etfs,股票和/或共同基金的历史数据怎么办?这是很多标签//窗口在浏览器中打开,不一定要打开。

->Selenium. However, my understanding is that this requires Thy Favorite Browser to actually open up a webpage, navigate around, and then not close because there's no "close this tab//window" command//option for Selenium. What if I//my_user want to get historical data for many etfs, stocks, and/or mutual funds? That's a lot of tabs//windows opening up in a browser which was not necessarily desired to be opened.

- > httplib2。我认为这很好,但我怀疑它是否会使用Javascript。是吗,例如使用.cache和get选项?

->httplib2. I think this is nice, but I'm doubtful if it will play with Javascript. Does it, for example using the .cache and get options?

import httplib2
conn = httplib2.Http(".cache")
page = conn.request(u"http://the_url","GET")

- >风车。见'Selenium'。然而,我非常关键地唱拉曼查人。

->Windmill. See 'Selenium'. I am, however, off-key enough to sing 'Man of La Mancha'.

- > Google的 webscraping 代码。尝试下载一个载有Javascript的页面会导致......积极的结果吗?

->Google's webscraping code. Would an attempt at downloading a Javascript-laden page result in ... positive results?

我已经阅读了关于必须在没有浏览器的情况下模拟浏览器的讨论。听起来像Mechanize,但不是我目前理解的Python3。

I've read chatter about having to "emulating a browser without a browser". Sounds like Mechanize, but not for Python3 as I currently understand.

::我的问题::

任何建议,指示,解决方案或查看此处的指示?

Any suggestions, pointers, solutions, or "look over here" directions?

非常感谢,

Miles,Dusty Desert Villager。

Miles, Dusty Desert Villager.

推荐答案

当页面通过javascript加载数据时,它必须向服务器通过XMLHttpRequest函数(XHR)获取该数据。你可以看到他们正在做什么请求,然后使用wget自己创建!

When a page loads data via javascript, it has to make requests to the server to get that data via the XMLHttpRequest function (XHR). You can see what requests they are making, and then make them yourself, using wget!

要找出他们正在做出哪些请求,请使用Web Inspector(Chrome和Safari) )或Firebug(Firefox)。以下是在Chrome中执行此操作的方法:

To find out which requests they are making, use the Web Inspector (Chrome and Safari) or Firebug (Firefox). Here's how to do it in Chrome:

扳手/工具/开发人员工具/网络(工具顶部的标签)/底部的XHR过滤器。

wrench/tools/developer tools/Network (tab at the top of the tools)/XHR filter at the bottom.

以下是他们在javascript中发出的示例请求

如果仔细查看XHR请求网址,您会发现所有跟踪返回的格式都相同:

If you look closely at the XHR request url, you notice that all trailing returns have the same format:

http://performance.morningstar.com/Performance/cef/trailing-total-returns.action?t=

您只需要指定 T 。例如:

http://performance.morningstar.com/Performance/cef/trailing-total-returns.action?t=VAW
http://performance.morningstar.com/Performance/cef/trailing-motal-returns.action?t=INTC
http:// performance .morningstar.com / Performance / cef / trailing-total-returns.action?t = VHCOX

现在你可以 wget 这些URI并直接解析数据。

Now you can wget those URIs and parse out the data directly.

这篇关于Python 3,Web抓取和Javascript [哦我的]的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆