python scrapy如何使用BaseDupeFilter [英] python scrapy how to use BaseDupeFilter

查看:59
本文介绍了python scrapy如何使用BaseDupeFilter的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个网站有很多这样的页面:

I have a website have many pages like this:

mywebsite/?page=1

mywebsite/?page=2

...

...

...

mywebsite/?page=n

每个页面都有指向玩家的链接.当您点击任何链接时,您会转到该播放器的页面.

each page have links to players. when you click on any link, you go to the page of that player.

用户可以添加玩家,所以我最终会遇到这种情况.

Users can add players so I will end up with this situation.

Player1page=1 中有一个链接.

Player1 has a link in page=1.

Player10page=2

一小时后

,因为用户添加了新玩家.我会有这种情况.

after an hour

because users have added new players. i will have this situation.

Player1page=3

Player10page=4

以及像 Player100Player101 这样的新播放器在 page=1

and the new players like Player100 and Player101 have links in page=1

我想收集所有玩家的信息以获取他们的信息.但是,我不想废弃我已经废弃的玩家.我的问题是如何在 scrapy 中使用 BaseDupeFilter 来识别该播放器已被抓取而这不是.请记住,我想在网站的 pages 上进行 sracp,因为每个页面每次都会有不同的玩家.

I want to scrap on all players to get their information. However, I don't want to scrap on players that I have already scrap. My question is how to user the BaseDupeFilter in scrapy to identify that this player has been scraped and this not. Remember, I want to sracp on pages of the website because each page will have different players in each time.

谢谢.

推荐答案

我会采取另一种方法,尝试在蜘蛛运行期间不查询最后一个玩家,而是使用最后一次抓取的预先计算的参数启动蜘蛛玩家:

I'd take another approach and try not to query for the last player during spider run, but rather launch the spider with a pre calculated argument of the last scraped player:

scrapy crawl <my spider> -a last_player=X

那么你的蜘蛛可能看起来像:

then your spider may look like:

class MySpider(BaseSpider):
    start_urls = ["http://....mywebsite/?page=1"]
    ...
    def parse(self, response):
        ...
        last_player_met = False
        player_links = sel.xpath(....)
        for player_link in player_links:
            player_id = player_link.split(....)
            if player_id < self.last_player:
                 yield Request(url=player_link, callback=self.scrape_player)
            else:
                last_player_met = True
        if not last_player_met:
            # try to xpath for 'Next' in pagination 
            # or use meta={} in request to loop over pages like 
            # "http://....mywebsite/?page=" + page_number
            yield Request(url=..., callback=self.parse)

这篇关于python scrapy如何使用BaseDupeFilter的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆