带有多个回调的 Scrapy CrawlSpider 规则 [英] Scrapy CrawlSpider rules with multiple callbacks

查看:42
本文介绍了带有多个回调的 Scrapy CrawlSpider 规则的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试创建一个实现了爬虫爬虫的 ExampleSpider.我的 ExampleSpider 应该能够处理只包含艺术家信息的页面,仅包含专辑信息的页面,以及包含专辑和艺术家信息的其他一些页面.

I'm tring to create an ExampleSpider which implements scrapy CrawlSpider. My ExampleSpider should be able to process pages containing only artist info, pages containing only album info, and some other pages which contains both album and artist info.

我能够处理前两种情况.但问题发生在第三种情况.我使用 parse_artist(response) 方法处理艺术家数据,parse_album(response) 方法处理专辑数据.我的问题是,如果一个页面同时包含艺术家和专辑数据,我应该如何定义我的规则?

I was able to handle First two scenarios. but the problem occurs in third scenario. I'm using parse_artist(response) method to process artist data, parse_album(response) method to process album data. My question is, If a page contains both artist and album data, how should I define my rules?

  1. 我应该像下面这样吗?(相同网址格式的两条规则)
  2. 我应该多次回调吗?(scrapy 是否支持多个回调?)
  3. 还有其他方法吗?(有道)

  1. Shoud I do like below? (Two rules for same url pattern)
  2. Should I multiple callbacks? (Does scrapy support multiple callbacks?)
  3. Is there other way to do it. (A proper way)

class ExampleSpider(CrawlSpider):
    name = 'example'

    start_urls = ['http://www.example.com']

    rules = [
        Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_artist', follow=True),
        Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_album', follow=True),
        # more rules .....
    ]

    def parse_artist(self, response):
        artist_item = ArtistItem()
        try:
            # do the scrape and assign to ArtistItem
        except Exception:
            # ignore for now
            pass
        return artist_item
        pass

    def parse_album(self, response):
        album_item = AlbumItem()
        try:
            # do the scrape and assign to AlbumItem
        except Exception:
            # ignore for now
            pass
        return album_item
        pass
    pass

推荐答案

CrawlSpider 调用 _requests_to_follow() 方法来提取 url 并生成要关注的请求:

The CrawlSpider calls _requests_to_follow() method to extract urls and generate requests to follow:

def _requests_to_follow(self, response):
    if not isinstance(response, HtmlResponse):
        return
    seen = set()
    for n, rule in enumerate(self._rules):
        links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
        if links and rule.process_links:
            links = rule.process_links(links)
        seen = seen.union(links)
        for link in links:
            r = Request(url=link.url, callback=self._response_downloaded)
            r.meta.update(rule=n, link_text=link.text)
            yield rule.process_request(r)

如您所见:

  • 记忆urls的变量seen已被处理.
  • 每个url 最多会被一个callback 解析.
  • The variable seen memorizes urls has been processed.
  • Every url will be parsed by at most one callback.

你可以定义一个parse_item()来调用parse_artist()parse_album():

You can define a parse_item() to call parse_artist() and parse_album():

rules = [
    Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_item', follow=True),
    # more rules .....
]

def parse_item(self, response):

    yield self.parse_artist(response)
    yield self.parse_album(response)

这篇关于带有多个回调的 Scrapy CrawlSpider 规则的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆