Scrapy spider_idle 信号 - 需要添加带有解析项回调的请求 [英] Scrapy spider_idle signal - need to add requests with parse item callback

查看：23 发布时间：2021/7/16 21:57:08 scrapy scrapy-spider scrapy-signal

本文介绍了Scrapy spider_idle 信号 - 需要添加带有解析项回调的请求的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在我的 Scrapy 蜘蛛中，我重写了 start_requests() 方法，以便从数据库中检索一些额外的 url，这些 URL 代表可能在爬行中遗漏的项目(孤立项目).这应该在爬行过程结束时发生.类似(伪代码):

In my Scrapy spider I have overridden the start_requests() method, in order to retrieve some additional urls from a database, that represent items potentially missed in the crawl (orphaned items). This should happen at the end of the crawling process. Something like (pseudo code):

def start_requests(self):
    for url in self.start_urls:
        yield Request(url, dont_filter=True)

    # attempt to crawl orphaned items
    db = MySQLdb.connect(host=self.settings['AWS_RDS_HOST'],
                         port=self.settings['AWS_RDS_PORT'],
                         user=self.settings['AWS_RDS_USER'],
                         passwd=self.settings['AWS_RDS_PASSWD'],
                         db=self.settings['AWS_RDS_DB'],
                         cursorclass=MySQLdb.cursors.DictCursor,
                         use_unicode=True,
                         charset="utf8",)
    c=db.cursor()

    c.execute("""SELECT p.url FROM products p LEFT JOIN product_data pd ON p.id = pd.product_id AND pd.scrape_date = CURDATE() WHERE p.website_id = %s AND pd.id IS NULL""", (self.website_id,))

    while True:
        url = c.fetchone()
        if url is None:
            break
        # record orphaned product
        self.crawler.stats.inc_value('orphaned_count')
        yield Request(url['url'], callback=self.parse_item)
    db.close()

不幸的是，似乎爬虫在剩余的爬行过程中将这些孤立项排入队列 - 因此，实际上，太多被视为孤立项(因为爬行器尚未在正常爬行中检索这些项目，当执行数据库查询).

Unfortunately, it appears as though the crawler queues up these orphaned items during the rest of the crawl - so, in effect, too many are regarded as orphaned (because the crawler has not yet retrieved these items in the normal crawl, when the database query is executed).

我需要这个孤立进程在爬行结束时发生 - 所以我相信我需要使用 spider_idle 信号.

I need this orphaned process to happen at the end of the crawl - so I believe I need to use the spider_idle signal.

但是，我的理解是我不能简单地在我的蜘蛛空闲方法中产生请求 - 而是我可以使用 self.crawler.engine.crawl?

However, my understanding is I can't just simply yield requests in my spider idle method - instead I can use self.crawler.engine.crawl?

我需要由我的蜘蛛的 parse_item() 方法处理请求(以及我配置的中间件、扩展和管道).我怎样才能做到这一点?

I need requests to be processed by my spider's parse_item() method (and for my configured middleware, extensions and pipelines to be obeyed). How can I achieve this?

Scrapy spider_idle 信号 - 需要添加带有解析项回调的请求 [英] Scrapy spider_idle signal - need to add requests with parse item callback

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Scrapy spider_idle 信号 - 需要添加带有解析项回调的请求 [英] Scrapy spider_idle signal - need to add requests with parse item callback

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭