使用 Scrapy 抓取多个域而无需交叉 [英] Crawl multiple domains with Scrapy without criss-cross

查看:34
本文介绍了使用 Scrapy 抓取多个域而无需交叉的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经设置了一个 CrawlSpider,它聚合了所有出站链接(仅从 start_urls 抓取特定深度,例如 DEPTH_LIMIT = 2).

I have set up a CrawlSpider aggregating all outbound links (crawling from start_urls only a certain depth via e.g. DEPTH_LIMIT = 2).

class LinkNetworkSpider(CrawlSpider):

    name = "network"
    allowed_domains = ["exampleA.com"]

    start_urls = ["http://www.exampleA.com"]

    rules = (Rule(SgmlLinkExtractor(allow=()), callback='parse_item', follow=True),)

    def parse_start_url(self, response):
        return self.parse_item(response)

    def parse_item(self, response):

        hxs = HtmlXPathSelector(response)
        links = hxs.select('//a/@href').extract()

        outgoing_links = []

        for link in links:
            if ("http://" in link):
                base_url = urlparse(link).hostname
                base_url = base_url.split(':')[0]  # drop ports
                base_url = '.'.join(base_url.split('.')[-2:])  # remove subdomains
                url_hit = sum(1 for i in self.allowed_domains if base_url not in i)
                if url_hit != 0:
                    outgoing_links.append(link)

        if outgoing_links:
            item = LinkNetworkItem()
            item['internal_site'] = response.url
            item['out_links'] = outgoing_links
            return [item]
        else:
            return None

我想将此扩展到多个域(exampleA.com、exampleB.com、exampleC.com ...).起初,我以为我可以将我的列表添加到 start_urls 以及 allowed_domains 但在我看来这会导致以下问题:

I want to extend this to multiple domains (exampleA.com, exampleB.com, exampleC.com ...). At first, I thought i can just add my list to start_urls as well as allowed_domains but in my opinion this causes following problems:

  • 是否会为每个 start_urls/allowed_domain 应用设置 DEPTH_LIMIT?
  • 更重要的是:如果站点已连接,蜘蛛是否会从 exampleA.com 跳转到 exampleB.com,因为两者都在 allowed_domains 中?我需要避免这种交叉,因为我稍后要计算每个站点的出站链接以获取有关网站之间关系的信息!

那么我怎样才能扩展更多的蜘蛛而不遇到交叉和使用每个网站的设置的问题?

So how can i scale more spider without running into the problem of criss-crossing and using the settings per website?

显示我想实现的附加图像:

Additional image showing what i would like to realize:

推荐答案

我现在已经实现了,没有规则.我为每个 start_url 附加了一个 meta 属性,然后简单地检查自己的链接是否属于原始域并相应地发出新请求.

I have now achieved it without rules. I attached a meta attribute to every start_url and then simply check myself whether the links belong to the original domain and sent out new requests correspondingly.

因此,覆盖start_requests:

def start_requests(self):
    return [Request(url, meta={'domain': domain}, callback=self.parse_item) for url, domain in zip(self.start_urls, self.start_domains)]

在后续的解析方法中,我们抓取meta属性domain = response.request.meta['domain'],将域与提取的链接进行比较并发送新的要求自己.

In subsequent parsing methods we grab the meta attribute domain = response.request.meta['domain'], compare the domain with the extracted links and sent out new requests ourselves.

这篇关于使用 Scrapy 抓取多个域而无需交叉的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆