爬虫如何将值返回给另一个蜘蛛 [英] scrapy how spider returns value to another spider
问题描述
我正在抓取的网站包含许多玩家,当我点击任何玩家时,我可以转到他的页面.
The website that I am crawling contains many players and when I click on any player, I can go the his page.
网站结构是这样的:
<main page>
<link to player 1>
<link to player 2>
<link to player 3>
..
..
..
<link to payer n>
</main page>
当我点击任何链接时,我会转到如下所示的玩家页面:
And when I click on any link, I go to player's page which is like this:
<player name>
<player team>
<player age>
<player salary>
<player date>
我想淘汰所有年龄在 20 到 25 岁之间的球员.
I want to scrap all the players those age is between 20 and 25 years.
抓取主页使用第一个蜘蛛.
获取链接使用第一个蜘蛛.
抓取每个链接使用第二个蜘蛛.
使用第二只蜘蛛获取玩家信息.
将此信息保存在 json 文件中使用管道.
save this information in json file using pipeline.
我的问题
我怎样才能将second spider
的date
值返回给first spider
我构建了自己的中间件并覆盖了process_spider_output
.它允许我打印请求,但我不知道我还应该做什么才能将该 date
值返回给我的第一个蜘蛛
I build my own middelware and i override the process_spider_output
. it allows me to print the request but I don't know what else should I do in order to return that date
value to my first spider
感谢任何帮助
以下是部分代码:
def parse(self, response):
sel = Selector(response)
Container = sel.css('div[MyDiv]')
for player in Container:
extract LINK and TITLE
yield Request(LINK, meta={'Title': Title}, callback = self.parsePlayer)
def parsePlayer(self,response):
player = new PlayerItem();
extract DATE
return player
为了方便您,我给了您通用代码,而不是非常具体的细节
推荐答案
首先要感谢@warwaruk, @Robin 帮助我解决这个问题.
First of all, I want to thank @warwaruk, @Robin for helping me in this issue.
最感谢我的好老师@pault
我找到了解决方案,这是算法:
I found the solution and here is the algorithm:
- 在主页上开始抓取.
- 提取所有玩家的链接.
- 回调每个玩家的链接以提取他的信息.请求的元数据包括:当前主页中的玩家数量和我想报废的玩家的位置.
在每个玩家的回调中:
- start scraping in the main page.
- extracting all the players' links.
- call back on each player's link to extract his information. and the request's meta includes: the number of players in the current main page and the position of the player that I want to scrap.
In the callback for each player:
4.1 提取玩家信息.
4.1 extract player's information.
4.2 检查日期是否在热播中,如果不是:什么都不做,如果是:检查这是否是主玩家列表中的最后一场比赛.如果是,回调到第二个主页.
4.2 check if the date in the rage, if no: do nothing, if yes: check if this is the last play in the main player list. if yes, callback to the second main page.
简单代码
def parse(self, response):
currentPlayer = 0
for each player in Players:
currentPlayer +=1
yield Request(player.link, meta={'currentPlayer':currentPlayer, 'numberOfPlayers':len(Players),callback = self.parsePlayer)
def parsePlayer(self,response):
currentPlayer = meta['currentPlayer]
numberOfPlayers = meta['numberOfPlayers']
extract player's information
if player[date] in range:
if currentPlayer == numberOfPlayers:
yield(linkToNextMainPage, callback = self.parse)
yield playerInformatoin #in order to be written in JSON file
else:
yield playerInformaton
效果很好:)
这篇关于爬虫如何将值返回给另一个蜘蛛的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!