动态添加域到scrapy crawlspider deny_domains 列表 [英] Dynamically adding domains to scrapy crawlspider deny_domains list

查看：48 发布时间：2021/7/16 22:11:26 python scrapy

本文介绍了动态添加域到scrapy crawlspider deny_domains 列表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我目前正在使用 scrapy 的 CrawlSpider 在多个 start_url 列表中查找特定信息.我想要做的是在找到我要查找的信息后停止抓取特定 start_url 的域，这样它就不会继续访问某个域，而只会访问其他 start_url.

I am currently using scrapy's CrawlSpider to look for specific info on a list of multiple start_urls. What I would like to do is stop scraping a specific start_url's domain once I've found the information I've looked for, so it won't keep hitting a domain and will instead just hit the other start_urls.

有没有办法做到这一点?我试过像这样附加到 deny_domains:

Is there a way to do this? I have tried appending to deny_domains like so:

deniedDomains = []
...
rules = [Rule(SgmlLinkExtractor(..., deny_domains=(etc), ...)]
...
def parseURL(self, response):
    ...
    self.deniedDomains.append(specificDomain)

附加似乎并没有停止爬行，但是如果我用预期的特定域启动蜘蛛，那么它会按要求停止.所以我假设你不能在蜘蛛启动后更改 deny_domains 列表?

Appending doesn't seem to stop the crawling, but if I start the spider with the intended specificDomain then it'll stop as requested. So I'm assuming that you can't change the deny_domains list after the spider's started?

动态添加域到scrapy crawlspider deny_domains 列表 [英] Dynamically adding domains to scrapy crawlspider deny_domains list

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

动态添加域到scrapy crawlspider deny_domains 列表 [英] Dynamically adding domains to scrapy crawlspider deny_domains list

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭