使用 Scrapy 抓取多个域而无需交叉 [英] Crawl multiple domains with Scrapy without criss-cross
问题描述
我已经设置了一个 CrawlSpider,它聚合了所有出站链接(仅从 start_urls
抓取特定深度,例如 DEPTH_LIMIT = 2
).
I have set up a CrawlSpider aggregating all outbound links (crawling from start_urls
only a certain depth via e.g. DEPTH_LIMIT = 2
).
class LinkNetworkSpider(CrawlSpider):
name = "network"
allowed_domains = ["exampleA.com"]
start_urls = ["http://www.exampleA.com"]
rules = (Rule(SgmlLinkExtractor(allow=()), callback='parse_item', follow=True),)
def parse_start_url(self, response):
return self.parse_item(response)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
links = hxs.select('//a/@href').extract()
outgoing_links = []
for link in links:
if ("http://" in link):
base_url = urlparse(link).hostname
base_url = base_url.split(':')[0] # drop ports
base_url = '.'.join(base_url.split('.')[-2:]) # remove subdomains
url_hit = sum(1 for i in self.allowed_domains if base_url not in i)
if url_hit != 0:
outgoing_links.append(link)
if outgoing_links:
item = LinkNetworkItem()
item['internal_site'] = response.url
item['out_links'] = outgoing_links
return [item]
else:
return None
我想将此扩展到多个域(exampleA.com、exampleB.com、exampleC.com ...).起初,我以为我可以将我的列表添加到 start_urls
以及 allowed_domains
但在我看来这会导致以下问题:
I want to extend this to multiple domains (exampleA.com, exampleB.com, exampleC.com ...). At first, I thought i can just add my list to start_urls
as well as allowed_domains
but in my opinion this causes following problems:
- 是否会为每个
start_urls
/allowed_domain
应用设置DEPTH_LIMIT
? - 更重要的是:如果站点已连接,蜘蛛是否会从 exampleA.com 跳转到 exampleB.com,因为两者都在 allowed_domains 中?我需要避免这种交叉,因为我稍后要计算每个站点的出站链接以获取有关网站之间关系的信息!
那么我怎样才能扩展更多的蜘蛛而不遇到交叉和使用每个网站的设置的问题?
So how can i scale more spider without running into the problem of criss-crossing and using the settings per website?
显示我想实现的附加图像:
Additional image showing what i would like to realize:
推荐答案
我现在已经实现了,没有规则.我为每个 start_url
附加了一个 meta
属性,然后简单地检查自己的链接是否属于原始域并相应地发出新请求.
I have now achieved it without rules. I attached a meta
attribute to every start_url
and then simply check myself whether the links belong to the original domain and sent out new requests correspondingly.
因此,覆盖start_requests
:
def start_requests(self):
return [Request(url, meta={'domain': domain}, callback=self.parse_item) for url, domain in zip(self.start_urls, self.start_domains)]
在后续的解析方法中,我们抓取meta
属性domain = response.request.meta['domain']
,将域与提取的链接进行比较并发送新的要求自己.
In subsequent parsing methods we grab the meta
attribute domain = response.request.meta['domain']
, compare the domain with the extracted links and sent out new requests ourselves.
这篇关于使用 Scrapy 抓取多个域而无需交叉的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!