尝试从基于Ajax的Webstie抓取数据时,如何使用Scrapy模拟xhr请求? [英] How to simulate xhr request using Scrapy when trying to crawl data from an ajax-based webstie?

查看:90
本文介绍了尝试从基于Ajax的Webstie抓取数据时,如何使用Scrapy模拟xhr请求?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是使用Scrapy抓取网页的新手,很遗憾,我选择了动态网页来开始...

I am new to crawl the webpage using Scrapy and unfortunately chose a dynamic one to start...

由于有人帮助我目标网站

I've successfully crawled part (120 links), thanks to someone helping me here, but not links in target website

经过研究后,我知道爬行ajax网站与这些简单的想法没有什么不同:

After doing some research, I know that crawling ajax web is nothing different from those simple ideas:

•打开浏览器开发者工具的网络"标签

•open browser developer tools, network tab

•转到目标网站

•单击提交"按钮,查看什么XHR请求将发送到服务器

•click submit button and see what XHR request is going to the server

•在您的蜘蛛中模拟此XHR请求

•simulate this XHR request in your spider

最后一个听起来对我来说很晦涩--如何模拟XHR请求?

The last one sounds obscure to me though---How to simulate XHR request?

我见过有人使用标头"或"formdata"以及其他参数进行模拟.无法弄清楚这是什么意思.

I've seen someone using 'headers' or 'formdata' and other parameters to simulate. Can't figure out what does that mean.

这是我的代码的一部分:

Here is part of my code:

class googleAppSpider(scrapy.Spider):
name = "googleApp"
allowed_domains = ['play.google.com']
start_urls = ['https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0']

def start_request(self,response):
    for i in range(0,10): 
        yield FormRequest(url="https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0", method="POST", formdata={'start':str(i+60),'num':'60','numChildren':'0','ipf':'1','xhr':'1','token':'m1VdlomIcpZYfkJT5dktVuqLw2k:1455483261011'}, callback=self.parse)

def parse(self,response):
    links = response.xpath("//a/@href").extract()
    crawledLinks = [ ]
    LinkPattern = re.compile("^/store/apps/details\?id=.")
    for link in links:
        if LinkPattern.match(link) and not link in crawledLinks:
            crawledLinks.append("http://play.google.com"+link+"#release")
    for link in crawledLinks:
            yield scrapy.Request(link, callback=self.parse_every_app)

def parse_every_app(self,response):

start_request在这里似乎没有任何作用.如果我删除它们,爬虫仍将抓取相同数量的链接.

The start_request seems to not play any role here. If I delete them, the spider would still crawl the same amount of links.

我已经在这个问题上工作了一个星期...如果您能帮助我,请多多关照...

I've worked on this problem for a week... Highly appreciate it if you could help me out...

推荐答案

尝试一下:

class googleAppSpider(Spider):
    name = "googleApp"
    allowed_domains = ['play.google.com']
    start_urls = ['https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0']

    def parse(self,response):
        for i in range(0,10): 
            yield FormRequest(url="https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0", method="POST", formdata={'start':str(i*60),'num':'60','numChildren':'0','ipf':'1','xhr':'1','token':'m1VdlomIcpZYfkJT5dktVuqLw2k:1455483261011'}, callback=self.data_parse)

    def data_parse(self,response):
        item = googleAppItem()
        map = {}
        links = response.xpath("//a/@href").re(r'/store/apps/details.*')
        for l in links:
            if l not in map:
                map[l] = True
                item['url'] = l
                yield item

使用scrapy crawl -o links.csvscrapy crawl -o links.json爬行蜘蛛,您将在csv文件或json文件中获得所有链接.要增加要爬网的页面数,请更改for循环的范围.

Crawl the spider using scrapy crawl -o links.csv or scrapy crawl -o links.json you'll get all the links in a csv file or a json file. To increase the number of pages to crawl, change the range of for loop.

这篇关于尝试从基于Ajax的Webstie抓取数据时,如何使用Scrapy模拟xhr请求?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆