使用 Python 请求抓取整个滚动加载页面 [英] Scrape entire scrolling-load page with Python Requests

查看:54
本文介绍了使用 Python 请求抓取整个滚动加载页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

具体来说,我试图抓取整个页面,但只获取其中的一部分:

Specifically, I'm trying to scrape this entire page, but am only getting a portion of it:

http://store.nike.com/us/en_us/pw/mens-shoes/7puZoi3?ipp=120

如果我使用:

 r= requests.get('http://store.nike.com/us/en_us/pw/mens-shoes/7puZoi3?ipp=120')

它只获取页面的可见"部分,因为向下滚动时会加载更多项目

it only gets the "visible" part of the page, since more items load as you scroll downwards

我知道 PyQT 中有一些解决方案,例如:

I know there are some solutions in PyQT such as this:

反复滚动到页面底部使用 PyQt QWebKit 的页面

但是有没有办法让 python 请求不断滚动到网页底部,直到所有项目都加载完毕?

but is there a way to have python requests continuously scroll to the bottom of a webpage until all items load?

推荐答案

您可以使用浏览器开发控制台(F12 - Chrome 中的网络)监视页面网络活动,以查看向下滚动时页面执行的请求,使用该数据并使用 requests 重现请求.作为替代方案,您可以使用 selenium 以编程方式控制浏览器向下滚动直到页面结束,然后保存其 HTML.

You could monitor page network activity with browser development console (F12 - Network in Chrome) to see what request does the page do when you scroll down, use that data and reproduce the request with requests. As an alternative, you can use selenium to control a browser programmatically to scroll down until page is ended, then save its HTML.

我想我找到了正确的请求

I guess I found the right request

Request URL:http://store.nike.com/html-services/gridwallData?country=US&lang_locale=en_US&gridwallPath=mens-shoes/7puZoi3&pn=3
Request Method:GET
Status Code:200 OK
Remote Address:87.245.221.98:80

请求标头

Provisional headers are shown
Accept:application/json, text/javascript, */*; q=0.01
Referer:http://store.nike.com/us/en_us/pw/mens-shoes/7puZoi3?ipp=120
User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36
X-NewRelic-ID:VQYGVF5SCBAJVlFaAQIH
X-Requested-With:XMLHttpRequest

似乎查询参数 pn 表示当前的子页面".但是您仍然需要正确理解响应.

Seems that query parameter pn means the current "subpage". But you still need to understand the response correctly.

这篇关于使用 Python 请求抓取整个滚动加载页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆