网络抓取问题站点 [英] web scraping a problem site

查看：35 发布时间：2021/7/17 18:41:39 python screen-scraping

本文介绍了网络抓取问题站点的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图从网站上抓取一些信息，但在阅读相关页面时遇到问题.这些页面似乎首先发送基本设置，然后是更详细的信息.我的下载尝试似乎只捕获了基本设置.到目前为止，我已经尝试过 urllib 和机械化.

I'm trying to scrape some information from a web site, but am having trouble reading the relevant pages. The pages seem to first send a basic setup, then more detailed info. My download attempts only seem to capture the basic setup. I've tried urllib and mechanize so far.

Firefox 和 Chrome 显示页面没有问题，虽然我在查看页面源时看不到我想要的部分.

Firefox and Chrome have no trouble displaying the pages, although I can't see the parts I want when I view page source.

示例网址是 https://personal.vanguard.com/us/funds/snapshot?FundId=0542&FundIntExt=INT

例如，我想要页面右下角的平均成熟度和平均持续时间.问题不是从页面中提取该信息，而是下载页面以便我可以提取信息.

I'd like, for example, average maturity and average duration from the lower right of the page. The problem isn't extracting that info from the page, it's downloading the page so that I can extract the info.

推荐答案

网站通过ajax加载数据.Firebug 显示了 ajax 调用.对于给定页面，数据从 https://personal.vanguard.com/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542

The website loads the data via ajax. Firebug shows the ajax calls. For the given page, the data is loaded from https://personal.vanguard.com/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542

查看原页面对应的javascript代码:

See the corresponding javascript code on the original page:

<script>populator = new Populator({parentId:
"profileForm:vanguardFundTabBox:tab0",execOnLoad:true,
 populatorUrl:"/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542",
inline:fals   e,type:"once"});
</script>

这篇关于网络抓取问题站点的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

网络抓取问题站点 [英] web scraping a problem site

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

网络抓取问题站点 [英] web scraping a problem site

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭