动态添加域到scrapy crawlspider deny_domains 列表 [英] Dynamically adding domains to scrapy crawlspider deny_domains list
问题描述
我目前正在使用 scrapy 的 CrawlSpider 在多个 start_url 列表中查找特定信息.我想要做的是在找到我要查找的信息后停止抓取特定 start_url 的域,这样它就不会继续访问某个域,而只会访问其他 start_url.
I am currently using scrapy's CrawlSpider to look for specific info on a list of multiple start_urls. What I would like to do is stop scraping a specific start_url's domain once I've found the information I've looked for, so it won't keep hitting a domain and will instead just hit the other start_urls.
有没有办法做到这一点?我试过像这样附加到 deny_domains:
Is there a way to do this? I have tried appending to deny_domains like so:
deniedDomains = []
...
rules = [Rule(SgmlLinkExtractor(..., deny_domains=(etc), ...)]
...
def parseURL(self, response):
...
self.deniedDomains.append(specificDomain)
附加似乎并没有停止爬行,但是如果我用预期的特定域启动蜘蛛,那么它会按要求停止.所以我假设你不能在蜘蛛启动后更改 deny_domains 列表?
Appending doesn't seem to stop the crawling, but if I start the spider with the intended specificDomain then it'll stop as requested. So I'm assuming that you can't change the deny_domains list after the spider's started?
推荐答案
最好的方法是在你的 Spider 类中维护你自己的 dynamic_deny_domain
列表:
The best way to do this , is to maintain your own dynamic_deny_domain
list in your Spider class :
- 编写一个简单的下载中间件,
- 这是一个简单的类,有一个方法实现:
process_request(request, spider):
- 返回 IgnoreRequest 如果请求是在您的
spider.dynamic_deny_domain
列表中,否则None
.
- write a simple Downloader Middleware,
- it's a simple class, with one method implementation:
process_request(request, spider):
- return IgnoreRequest if the request is in your
spider.dynamic_deny_domain
list,None
otherwise.
然后将您的 downloaderMiddleWare 添加到 scrapy 设置中的中间件列表 , 在第一个位置'myproject.downloadermiddleware.IgnoreDomainMiddleware': 50,
Then add your downloaderMiddleWare to Middleware list in scrapy settings , at first position
'myproject.downloadermiddleware.IgnoreDomainMiddleware': 50,
应该可以.
这篇关于动态添加域到scrapy crawlspider deny_domains 列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!