是否可以从scrapys调度程序队列中删除请求? [英] Is it possible to remove requests from scrapys scheduler queue?

查看:40
本文介绍了是否可以从scrapys调度程序队列中删除请求?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以从scrapy的调度程序队列中删除请求?我有一个工作例程,可以在一定时间内限制对某个域的爬行.它的工作原理是,一旦达到时间限制,它就不会再产生链接,但由于队列已经可以包含数以千计的域请求,我想在达到时间限制后将它们从调度程序队列中删除.

Is it possible to remove requests from scrapy's scheduler queue? I have a working routine that limits crawling to a certain domain for a set amount of time. It's working in the sense that it will not yield anymore links once the time limit was hit but as the queue can already contain thousands of requests for the domain I'd like to remove them from the scheduler queue once the time limit is hit.

推荐答案

好吧,我最终遵循了 @rickgh12hs 的建议并编写了我自己的下载器中间件:

Okay so I ended up following the suggestion from @rickgh12hs and wrote my own Downloader Middleware:

from scrapy.exceptions import IgnoreRequest
import tldextract

class clearQueueDownloaderMiddleware(object):
    def process_request(self, request, spider):
        domain_obj = tldextract.extract(request.url)
        just_domain = domain_obj.registered_domain
        if(just_domain in spider.blocked):
            print "Blocked domain: %s (url: %s)" % (just_domain, request.url)
            raise IgnoreRequest("URL blocked: %s" % request.url)

spider.blocked 是一个类列表变量,它包含阻止从被阻止域进一步下载的阻止域.看起来效果很好,感谢 @rickgh12hs

spider.blocked is a class list variable that contains blocked domains preventing any further downloads from the blocked domains. Seem to work great, cudos to @rickgh12hs!

这篇关于是否可以从scrapys调度程序队列中删除请求?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆