爬虫如何将值返回给另一个蜘蛛 [英] scrapy how spider returns value to another spider

查看:55
本文介绍了爬虫如何将值返回给另一个蜘蛛的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在抓取的网站包含许多玩家,当我点击任何玩家时,我可以转到他的页面.

The website that I am crawling contains many players and when I click on any player, I can go the his page.

网站结构是这样的:

<main page>
<link to player 1>
<link to player 2>
<link to player 3>
..
..
..
<link to payer n>
</main page>

当我点击任何链接时,我会转到如下所示的玩家页面:

And when I click on any link, I go to player's page which is like this:

<player name>
<player team>
<player age>
<player salary>
<player date>

我想淘汰所有年龄在 20 到 25 岁之间的球员.

I want to scrap all the players those age is between 20 and 25 years.

  1. 抓取主页使用第一个蜘蛛.

获取链接使用第一个蜘蛛.

抓取每个链接使用第二个蜘蛛.

使用第二只蜘蛛获取玩家信息.

将此信息保存在 json 文件中使用管道.

save this information in json file using pipeline.

我的问题

我怎样才能将second spiderdate值返回给first spider

我构建了自己的中间件并覆盖了process_spider_output.它允许我打印请求,但我不知道我还应该做什么才能将该 date 值返回给我的第一个蜘蛛

I build my own middelware and i override the process_spider_output. it allows me to print the request but I don't know what else should I do in order to return that date value to my first spider

感谢任何帮助

以下是部分代码:

def parse(self, response):
        sel = Selector(response)
        Container = sel.css('div[MyDiv]')
        for player in Container:
            extract LINK and TITLE
            yield Request(LINK, meta={'Title': Title}, callback = self.parsePlayer)

def parsePlayer(self,response):
    player = new PlayerItem();
    extract DATE
    return player

为了方便您,我给了您通用代码,而不是非常具体的细节

推荐答案

首先要感谢@warwaruk, @Robin 帮助我解决这个问题.

First of all, I want to thank @warwaruk, @Robin for helping me in this issue.

最感谢我的好老师@pault

我找到了解决方案,这是算法:

I found the solution and here is the algorithm:

  1. 在主页上开始抓取.
  2. 提取所有玩家的链接.
  3. 回调每个玩家的链接以提取他的信息.请求的元数据包括:当前主页中的玩家数量和我想报废的玩家的位置.
  4. 在每个玩家的回调中:

  1. start scraping in the main page.
  2. extracting all the players' links.
  3. call back on each player's link to extract his information. and the request's meta includes: the number of players in the current main page and the position of the player that I want to scrap.
  4. In the callback for each player:

4.1 提取玩家信息.

4.1 extract player's information.

4.2 检查日期是否在热播中,如果不是:什么都不做,如果是:检查这是否是主玩家列表中的最后一场比赛.如果是,回调到第二个主页.

4.2 check if the date in the rage, if no: do nothing, if yes: check if this is the last play in the main player list. if yes, callback to the second main page.

简单代码

def parse(self, response):
    currentPlayer = 0
    for each player in Players:
        currentPlayer +=1
        yield Request(player.link, meta={'currentPlayer':currentPlayer, 'numberOfPlayers':len(Players),callback = self.parsePlayer)

def parsePlayer(self,response):
    currentPlayer = meta['currentPlayer]
    numberOfPlayers = meta['numberOfPlayers']
    extract player's information
    if player[date] in range:
        if currentPlayer == numberOfPlayers:
            yield(linkToNextMainPage, callback = self.parse)
            yield playerInformatoin #in order to be written in JSON file
        else:
            yield playerInformaton

效果很好:)

这篇关于爬虫如何将值返回给另一个蜘蛛的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆