下载Morningstar网页进行抓屏 [英] downloading morningstar webpages for screenscraping

查看:95
本文介绍了下载Morningstar网页进行抓屏的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望能够屏蔽Morningstar网页。晨星提供有关我经常查看但在其他地方找不到的共同基金的信息,即

I'd like to be able to screenscrape Morningstar webpages. Morningstar provides information about a mutual fund that I routinely look up but haven't been able to find elsewhere, ie


  1. 与基准相比的总收益

  2. 与同业相比的总回报

  3. 百分等级

这里有个例子: morningstar示例

作为抓屏的序幕,我需要能够下载具有所需内容的网页。不幸的是,当我尝试使用Java SE6或wget检索上面的示例链接时,我只得到了html的一部分(没有显示总回报数字的表)。如果使用浏览器(Chrome),则将页面另存为html时,会得到相同的结果。我注意到,如果使用浏览器保存完整的页面(html,js,css和其他所有内容),则下载的html确实包含有趣的信息。

As a prelude to screenscraping, I need to be able to download the webpage with the desired content. Unfortunately, when I try using Java SE6 or wget to retrieve the above example link, I only get a portion of the html (the tables displaying the total return figures are absent). I get the same result, if I use my browser (Chrome), to save the page as html only. I notice that if I use my browser to save the complete page (html, js, css, and everything else) the downloaded html does contain the interesting information.

我有两个问题:


  1. 如何以编程方式下载整个html文件?尽管我使用Java编写了该程序,但我不介意调用外部工具。

  2. 为什么我前面提到的尝试没有产生我期望的HTML?

谢谢。

作为旁注,我将Yahoo Finance和YQL / datatables视为替代方案,但Yahoo Finance没有提供百分位排名。如果您查看共同基金的表现,就会看到排名的N / A值。 Yahoo Finance示例。不幸的是,这将无法使用YQL /数据表。

As a side note, I looked at Yahoo Finance and YQL/datatables as alternatives but that Yahoo Finance doesn't provide percentile rankings. If you look up the performance of a mutual fund, you'll see N/A values for the rankings. Yahoo Finance example. Unfortunately, this would preclude using YQL/datatables.

关于Morningstar的版权的任何问题,我正在针对个人非商业用途进行抓屏,其版权通知允许第二段的最后一句:

Regarding any questions of Morningstar's copyright, I'm screenscraping for personal, non commercial use, which their copyright notice allows in the last sentence of the second paragraph:


您有权将其中包含的信息用于私人用途,
非商业用途只要。 Morningstar版权所有


推荐答案

要下载Morningstar网页,我需要一个工具来下载和解释与该网页关联的javascript代码。 StackOverflow上提到了许多用于不同编程语言和浏览器的工具。以下是我使用的那些方法:

To download the morningstar webpage, I needed a tool that would download and interpret the javascript code associated with the webpage. Many such tools for different programming languages and browsers are mentioned on StackOverflow. Here are the ones that I wound up using:

  • htmlunit - a GUI-less browser for Java programs
  • htmlunitscripter - a firefox add-on that autogenerates htmlunit code

这篇关于下载Morningstar网页进行抓屏的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆