Python Scrapy:allowed_domains 从数据库添加新域 [英] Python Scrapy : allowed_domains adding new domains from database
问题描述
我需要向 allowed_domains 添加更多域,所以我没有收到过滤的异地请求".
I need to add more domains to allowed_domains , so I dnt get the " Filtered offsite request to ".
我的应用需要从数据库中获取 url,所以我无法手动添加它们.
My app gets urls to fetch from a database, so I cant add them manually.
我试图覆盖蜘蛛 init
喜欢这个
def __init__(self):
super( CrawlSpider, self ).__init__()
self.start_urls = []
for destination in Phpbb.objects.filter(disable=False):
self.start_urls.append(destination.forum_link)
self.allowed_domains.append(destination.link)
start_urls 很好,这是我要解决的第一个问题.但 allow_domains 没有影响.
start_urls was fine, this was my first issue to solve. but the allow_domains makes no affect.
我需要更改一些配置以禁用域检查?我不想要这个,因为我只想要数据库中的那些,但它现在可以帮助我禁用域检查.
I need to change some configuration in order to disable domain checking? I dont want this since I only want the ones from the database, but It could help me for now to disable domain check.
谢谢!!
推荐答案
'allowed_domains'
参数是可选的.首先,您可以跳过它以禁用域过滤在
scrapy/contrib/spidermiddleware/offsite.py
中,您可以为自定义域过滤功能覆盖此功能:'allowed_domains'
parameter is optional. To get started, you can skip it to disable domain filteringIn
scrapy/contrib/spidermiddleware/offsite.py
you can override this function for your custom domain filtering function :def get_host_regex(self, spider): """Override this method to implement a different offsite policy""" allowed_domains = getattr(spider, 'allowed_domains', None) if not allowed_domains: return re.compile('') # allow all by default domains = [d.replace('.', r'\.') for d in allowed_domains] regex = r'^(.*\.)?(%s)$' % '|'.join(domains) return re.compile(regex)
这篇关于Python Scrapy:allowed_domains 从数据库添加新域的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!