网络抓取问题站点 [英] web scraping a problem site
问题描述
我试图从网站上抓取一些信息,但在阅读相关页面时遇到问题.这些页面似乎首先发送基本设置,然后是更详细的信息.我的下载尝试似乎只捕获了基本设置.到目前为止,我已经尝试过 urllib 和机械化.
I'm trying to scrape some information from a web site, but am having trouble reading the relevant pages. The pages seem to first send a basic setup, then more detailed info. My download attempts only seem to capture the basic setup. I've tried urllib and mechanize so far.
Firefox 和 Chrome 显示页面没有问题,虽然我在查看页面源时看不到我想要的部分.
Firefox and Chrome have no trouble displaying the pages, although I can't see the parts I want when I view page source.
示例网址是 https://personal.vanguard.com/us/funds/snapshot?FundId=0542&FundIntExt=INT
例如,我想要页面右下角的平均成熟度和平均持续时间.问题不是从页面中提取该信息,而是下载页面以便我可以提取信息.
I'd like, for example, average maturity and average duration from the lower right of the page. The problem isn't extracting that info from the page, it's downloading the page so that I can extract the info.
推荐答案
网站通过ajax加载数据.Firebug 显示了 ajax 调用.对于给定页面,数据从 https://personal.vanguard.com/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542
The website loads the data via ajax. Firebug shows the ajax calls. For the given page, the data is loaded from https://personal.vanguard.com/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542
查看原页面对应的javascript代码:
See the corresponding javascript code on the original page:
<script>populator = new Populator({parentId:
"profileForm:vanguardFundTabBox:tab0",execOnLoad:true,
populatorUrl:"/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542",
inline:fals e,type:"once"});
</script>
这篇关于网络抓取问题站点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!