使用 Python 请求抓取整个滚动加载页面 [英] Scrape entire scrolling-load page with Python Requests
问题描述
具体来说,我试图抓取整个页面,但只获取其中的一部分:
Specifically, I'm trying to scrape this entire page, but am only getting a portion of it:
http://store.nike.com/us/en_us/pw/mens-shoes/7puZoi3?ipp=120
如果我使用:
r= requests.get('http://store.nike.com/us/en_us/pw/mens-shoes/7puZoi3?ipp=120')
它只获取页面的可见"部分,因为向下滚动时会加载更多项目
it only gets the "visible" part of the page, since more items load as you scroll downwards
我知道 PyQT 中有一些解决方案,例如:
I know there are some solutions in PyQT such as this:
但是有没有办法让 python 请求不断滚动到网页底部,直到所有项目都加载完毕?
but is there a way to have python requests continuously scroll to the bottom of a webpage until all items load?
推荐答案
您可以使用浏览器开发控制台(F12 - Chrome 中的网络)监视页面网络活动,以查看向下滚动时页面执行的请求,使用该数据并使用 requests
重现请求.作为替代方案,您可以使用 selenium
以编程方式控制浏览器向下滚动直到页面结束,然后保存其 HTML.
You could monitor page network activity with browser development console (F12 - Network in Chrome) to see what request does the page do when you scroll down, use that data and reproduce the request with requests
. As an alternative, you can use selenium
to control a browser programmatically to scroll down until page is ended, then save its HTML.
我想我找到了正确的请求
I guess I found the right request
Request URL:http://store.nike.com/html-services/gridwallData?country=US&lang_locale=en_US&gridwallPath=mens-shoes/7puZoi3&pn=3
Request Method:GET
Status Code:200 OK
Remote Address:87.245.221.98:80
请求标头
Provisional headers are shown
Accept:application/json, text/javascript, */*; q=0.01
Referer:http://store.nike.com/us/en_us/pw/mens-shoes/7puZoi3?ipp=120
User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36
X-NewRelic-ID:VQYGVF5SCBAJVlFaAQIH
X-Requested-With:XMLHttpRequest
似乎查询参数 pn
表示当前的子页面".但是您仍然需要正确理解响应.
Seems that query parameter pn
means the current "subpage". But you still need to understand the response correctly.
这篇关于使用 Python 请求抓取整个滚动加载页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!