网络抓取问题站点 [英] web scraping a problem site

查看:35
本文介绍了网络抓取问题站点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从网站上抓取一些信息,但在阅读相关页面时遇到问题.这些页面似乎首先发送基本设置,然后是更详细的信息.我的下载尝试似乎只捕获了基本设置.到目前为止,我已经尝试过 urllib 和机械化.

I'm trying to scrape some information from a web site, but am having trouble reading the relevant pages. The pages seem to first send a basic setup, then more detailed info. My download attempts only seem to capture the basic setup. I've tried urllib and mechanize so far.

Firefox 和 Chrome 显示页面没有问题,虽然我在查看页面源时看不到我想要的部分.

Firefox and Chrome have no trouble displaying the pages, although I can't see the parts I want when I view page source.

示例网址是 https://personal.vanguard.com/us/funds/snapshot?FundId=0542&FundIntExt=INT

例如,我想要页面右下角的平均成熟度和平均持续时间.问题不是从页面中提取该信息,而是下载页面以便我可以提取信息.

I'd like, for example, average maturity and average duration from the lower right of the page. The problem isn't extracting that info from the page, it's downloading the page so that I can extract the info.

推荐答案

网站通过ajax加载数据.Firebug 显示了 ajax 调用.对于给定页面,数据从 https://personal.vanguard.com/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542

The website loads the data via ajax. Firebug shows the ajax calls. For the given page, the data is loaded from https://personal.vanguard.com/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542

查看原页面对应的javascript代码:

See the corresponding javascript code on the original page:

<script>populator = new Populator({parentId:
"profileForm:vanguardFundTabBox:tab0",execOnLoad:true,
 populatorUrl:"/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542",
inline:fals   e,type:"once"});
</script>

这篇关于网络抓取问题站点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆