scrapy 无法抓取页面中的所有链接 [英] scrapy can't crawl all links in a page

查看:61
本文介绍了scrapy 无法抓取页面中的所有链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试 scrapy 抓取 ajax 网站 http://play.google.com/store/apps/category/GAME/collection/top sell_new_free

I am trying scrapy to crawl a ajax website http://play.google.com/store/apps/category/GAME/collection/topselling_new_free

我想获得指向每个游戏的所有链接.

I want to get all the links directing to each game.

我检查页面的元素.它看起来像这样:页面的样子所以我想提取模式/store/apps/details?id=

I inspect the element of the page. And it looks like this: how the page looks like so I want to extract all links with the pattern /store/apps/details?id=

但是当我在 shell 中运行命令时,它什么都不返回:shell 命令

but when I ran commands in the shell, it returns nothing: shell command

我也试过//a/@href.也没有解决,但不知道发生了什么问题....

I've also tried //a/@href. didn't work out either but Don't know what is wrong going on....

  • 现在我可以抓取前 120 个链接,并按照有人告诉我的那样修改了 starturl 并添加了formdata",但之后就没有更多链接了.

有人可以帮我吗?

推荐答案

它实际上是一个 ajax-post-request 填充该页面上的数据.在scrapy shell中,你不会得到这个,而不是检查元素检查network标签,你会在那里找到请求.

It's actually an ajax-post-request which populates the data on that page. In scrapy shell, you won't get this, instead of inspect element check the network tab there you will find the request.

https://play.google.com/store/apps/category/GAME/collection/top sell_new_free?authuser=0 网址发出帖子请求formdata={'start':'0','num':'60','numChildren':'0','ipf':'1','xhr':'1'}

Make post request to https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0 url with formdata={'start':'0','num':'60','numChildren':'0','ipf':'1','xhr':'1'}

每次请求以 60 开始递增以获得分页结果.

Increment start by 60 on each request to get the paginated result.

这篇关于scrapy 无法抓取页面中的所有链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆