如何从需要用户导航的网站部分中抓取数据 [英] How to scrape data off a part of a site that requires user navigation

查看:24
本文介绍了如何从需要用户导航的网站部分中抓取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

例如说我想从这个页面抓取:

For example say I am trying to scrape from this page:

http://www.swtor.com/leaderboards/pvp/solo

它只显示前 50 个结果,很容易转到此链接并抓取这些数据.但是说我想抓取前 200 个.作为用户,我可以单击下一页并查看接下来的 50 个结果,但它不会生成新的 url.整个表格由一些 JavaScript 控制,而不仅仅是我可以关注的显式链接.

It only shows top 50 results, pretty easy to go to this link and scrape those data. But say I want to scrape top 200. As a user I can click next page and see the next 50 result, but it doesn't generate a new url. The whole table is controlled by some JavaScript and not just explicit links that I can follow.

在这种情况下,您如何使用代码导航到第二页等以抓取下一组记录?

In situation like this how can you use code to navigate to the 2nd page and so forth to scrape the next set of records?

推荐答案

如果您在浏览器开发人员工具上打开网络"面板,您可以看到站点为加载表数据所做的 XMLHttpRequest (XHR) 请求:

If you open the "Network" panel on you browser developer tools, you can see the XMLHttpRequest (XHR) requests the site does to load the table data:

http://www.swtor.com/lb/data?page=1&column=pvp_ranked_solo&season=6
http://www.swtor.com/lb/data?page=2&column=pvp_ranked_solo&season=6

此端点返回非常方便的 JSON,现在只需根据需要运行尽可能多的请求.提示:页数也在返回的 JSON 中,因此即使您想要所有记录,也无需解析单个 HTML 页面.

This endpoint returns very handy JSONs and now it's just a matter of running as many requests as necessary. Tip: the number of pages is also on the returned JSON, so there's no need to parse a single HTML page even if you want all records.

这篇关于如何从需要用户导航的网站部分中抓取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆