Scrapy CrawlSpider 基于 start_urls 的动态规则? [英] Dynamic rules based on start_urls for Scrapy CrawlSpider?

查看:59
本文介绍了Scrapy CrawlSpider 基于 start_urls 的动态规则?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个 Scrapy 抓取工具,它使用 CrawlSpider 来抓取网站、查看其内部链接并抓取任何外部链接的内容(域与原始域不同的链接).

I'm writing a Scrapy scraper that uses CrawlSpider to crawl sites, go over their internal links, and scrape the contents of any external links (links with a domain different from the original domain).

我设法用 2 条规则做到了这一点,但它们基于被抓取的站点的域.如果我想在多个网站上运行它,我会遇到问题,因为我不知道我当前使用的是哪个start_url",因此我无法适当地更改规则.

I managed to do that with 2 rules but they are based on the domain of the site being crawled. If I want to run this on multiple websites I run into a problem because I don't know which "start_url" I'm currently on so I can't change the rule appropriately.

这是我到目前为止的想法,它适用于一个网站,但我不确定如何将其应用于网站列表:

Here's what I came up with so far, it works for one website and I'm not sure how to apply it to a list of websites:

class HomepagesSpider(CrawlSpider):
    name = 'homepages'

    homepage = 'http://www.somesite.com'

    start_urls = [homepage]

    # strip http and www
    domain = homepage.replace('http://', '').replace('https://', '').replace('www.', '')
    domain = domain[:-1] if domain[-1] == '/' else domain

    rules = (
        Rule(LinkExtractor(allow_domains=(domain), deny_domains=()), callback='parse_internal', follow=True),
        Rule(LinkExtractor(allow_domains=(), deny_domains=(domain)), callback='parse_external', follow=False),
    )

    def parse_internal(self, response):

        # log internal page...

    def parse_external(self, response):

        # parse external page...

这可能可以通过在调用刮刀时将 start_url 作为参数传递来完成,但我正在寻找一种在刮刀本身内以编程方式执行此操作的方法.

This can probably be done by just passing the start_url as an argument when calling the scraper, but I'm looking for a way to do that programmatically within the scraper itself.

有什么想法吗?谢谢!

西蒙.

推荐答案

我发现了一个 非常相似的问题 并使用已接受的答案中提供的第二个选项来开发此问题的解决方法,因为它在scrapy 中不支持开箱即用.

I've found a very similar question and used the second option presented in the accepted answer to develop a workaround for this problem, since it's not supported out-of-the-box in scrapy.

我创建了一个函数,它获取一个 url 作为输入并为其创建规则:

I've created a function that gets a url as an input and creates rules for it:

def rules_for_url(self, url):

    domain = Tools.get_domain(url)

    rules = (
        Rule(LinkExtractor(allow_domains=(domain), deny_domains=()), callback='parse_internal', follow=True),
        Rule(LinkExtractor(allow_domains=(), deny_domains=(domain)), callback='parse_external', follow=False),
    )

    return rules

然后我覆盖了 CrawlSpider 的一些功能.

I then override some of CrawlSpider's functions.

  1. 我将 _rules 更改为字典,其中键是不同的网站域,值是该域的规则(使用 rules_for_url)._rules 的填充在 _compile_rules

  1. I changed _rules into a dictionary where the keys are the different website domains and the values are the rules for that domain (using rules_for_url). The population of _rules is done in _compile_rules

然后我对 _requests_to_follow_response_downloaded 进行适当的更改,以支持使用 _rules 的新方式.

I then make the appropriate changes in _requests_to_follow and _response_downloaded to support the new way of using _rules.

_rules = {}

def _requests_to_follow(self, response):
    if not isinstance(response, HtmlResponse):
        return
    seen = set()

    domain = Tools.get_domain(response.url)
    for n, rule in enumerate(self._rules[domain]):
        links = [lnk for lnk in rule.link_extractor.extract_links(response) 
                 if lnk not in seen]
        if links and rule.process_links:
            links = rule.process_links(links)
        for link in links:
            seen.add(link)
            r = self._build_request(domain + ';' + str(n), link)
            yield rule.process_request(r)

def _response_downloaded(self, response):

    meta_rule = response.meta['rule'].split(';')
    domain = meta_rule[0]
    rule_n = int(meta_rule[1])

    rule = self._rules[domain][rule_n]
    return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)

def _compile_rules(self):
    def get_method(method):
        if callable(method):
            return method
        elif isinstance(method, six.string_types):
            return getattr(self, method, None)

    for url in self.start_urls:
        url_rules = self.rules_for_url(url)
        domain = Tools.get_domain(url)
        self._rules[domain] = [copy.copy(r) for r in url_rules]
        for rule in self._rules[domain]:
            rule.callback = get_method(rule.callback)
            rule.process_links = get_method(rule.process_links)
            rule.process_request = get_method(rule.process_request)

此处查看原始函数.

现在蜘蛛将简单地检查 start_urls 中的每个 url 并创建一组特定于该 url 的规则.然后为每个被抓取的网站使用适当的规则.

Now the spider will simply go over each url in start_urls and create a set of rules specific for that url. Then use the appropriate rules for each website being crawled.

希望这对将来偶然发现此问题的任何人有所帮助.

Hope this helps anyone who stumbles upon this problem in the future.

西蒙.

这篇关于Scrapy CrawlSpider 基于 start_urls 的动态规则?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆