尝试从基于Ajax的Webstie抓取数据时,如何使用Scrapy模拟xhr请求? [英] How to simulate xhr request using Scrapy when trying to crawl data from an ajax-based webstie?
问题描述
我是使用Scrapy抓取网页的新手,很遗憾,我选择了动态网页来开始...
I am new to crawl the webpage using Scrapy and unfortunately chose a dynamic one to start...
由于有人帮助我目标网站
I've successfully crawled part (120 links), thanks to someone helping me here, but not links in target website
经过研究后,我知道爬行ajax网站与这些简单的想法没有什么不同:
After doing some research, I know that crawling ajax web is nothing different from those simple ideas:
•打开浏览器开发者工具的网络"标签
•open browser developer tools, network tab
•转到目标网站
•单击提交"按钮,查看什么XHR请求将发送到服务器
•click submit button and see what XHR request is going to the server
•在您的蜘蛛中模拟此XHR请求
•simulate this XHR request in your spider
最后一个听起来对我来说很晦涩--如何模拟XHR请求?
The last one sounds obscure to me though---How to simulate XHR request?
我见过有人使用标头"或"formdata"以及其他参数进行模拟.无法弄清楚这是什么意思.
I've seen someone using 'headers' or 'formdata' and other parameters to simulate. Can't figure out what does that mean.
这是我的代码的一部分:
Here is part of my code:
class googleAppSpider(scrapy.Spider):
name = "googleApp"
allowed_domains = ['play.google.com']
start_urls = ['https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0']
def start_request(self,response):
for i in range(0,10):
yield FormRequest(url="https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0", method="POST", formdata={'start':str(i+60),'num':'60','numChildren':'0','ipf':'1','xhr':'1','token':'m1VdlomIcpZYfkJT5dktVuqLw2k:1455483261011'}, callback=self.parse)
def parse(self,response):
links = response.xpath("//a/@href").extract()
crawledLinks = [ ]
LinkPattern = re.compile("^/store/apps/details\?id=.")
for link in links:
if LinkPattern.match(link) and not link in crawledLinks:
crawledLinks.append("http://play.google.com"+link+"#release")
for link in crawledLinks:
yield scrapy.Request(link, callback=self.parse_every_app)
def parse_every_app(self,response):
start_request在这里似乎没有任何作用.如果我删除它们,爬虫仍将抓取相同数量的链接.
The start_request seems to not play any role here. If I delete them, the spider would still crawl the same amount of links.
我已经在这个问题上工作了一个星期...如果您能帮助我,请多多关照...
I've worked on this problem for a week... Highly appreciate it if you could help me out...
推荐答案
尝试一下:
class googleAppSpider(Spider):
name = "googleApp"
allowed_domains = ['play.google.com']
start_urls = ['https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0']
def parse(self,response):
for i in range(0,10):
yield FormRequest(url="https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0", method="POST", formdata={'start':str(i*60),'num':'60','numChildren':'0','ipf':'1','xhr':'1','token':'m1VdlomIcpZYfkJT5dktVuqLw2k:1455483261011'}, callback=self.data_parse)
def data_parse(self,response):
item = googleAppItem()
map = {}
links = response.xpath("//a/@href").re(r'/store/apps/details.*')
for l in links:
if l not in map:
map[l] = True
item['url'] = l
yield item
使用scrapy crawl -o links.csv
或scrapy crawl -o links.json
爬行蜘蛛,您将在csv文件或json文件中获得所有链接.要增加要爬网的页面数,请更改for循环的范围.
Crawl the spider using scrapy crawl -o links.csv
or scrapy crawl -o links.json
you'll get all the links in a csv file or a json file. To increase the number of pages to crawl, change the range of for loop.
这篇关于尝试从基于Ajax的Webstie抓取数据时,如何使用Scrapy模拟xhr请求?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!