在 CrawlSpider 中以什么顺序评估规则? [英] In which order do the rules get evaluated in the CrawlSpider?

查看:30
本文介绍了在 CrawlSpider 中以什么顺序评估规则?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个关于在 CrawlSpider 中评估规则的顺序的问题.如果我有以下代码:

I have a question regarding the order in which the rules get evaluated in a CrawlSpider. If I have the code below:

from scrapy.contrib.spiders.crawl import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    start_urls = ['http://someurlhere.com']
    rules = (
        Rule(
            LinkExtractor(restrict_xpaths=[
                "//ul[@class='menu-categories']",
                "//ul[@class='menu-subcategories']"]),
            callback='first_callback'
        ),
        Rule(
            LinkExtractor(allow='/product.php?id=\d+'),
            callback='second_callback'
        )
    )

在这种情况下:

  • 引擎会在start_url列表中发送对'http://someurlhere.com'的请求并调用默认的parse回调当它得到响应时.
  • 然后在 parse 方法中,从它从上述步骤获得的响应中,它将根据我们提供给 FIRST LinkExtractor 的 xpath 从该响应中提取链接.
  • The engine will send a request for 'http://someurlhere.com' in the start_url list and call the default parse callback when it gets the response.
  • Then in the parse method, from the response it gets from the above step it will extract the links from that response based of the xpath we provided to the FIRST LinkExtractor.

现在我的问题是从 FIRST LinkExtractor 规则中提取的链接,它们是否只是在调度程序中调度而不是立即遵循?因此,在调度从第一个 LinkExtractor 中提取的所有链接之后,它将为所有这些链接调用 first_callback 方法,并将响应传递给该 first_callback?

Now my question is the links that are extracted from the FIRST LinkExtractor rule, are they simply scheduled in the scheduler and not followed immediately? So after it schedules all the links which are extracted from the first LinkExtractor then it will call the first_callback method for all of those links with the response passed to that first_callback?

还有第二个 LinkExtractor 什么时候被调用?是否第一个 LinkExtractor 得到评估,然后只有第二个 LinkExtractor 运行?

Also when is the second LinkExtractor going to be called? Does the first LinkExtractor get evaluated and then only the second LinkExtractor runs?

推荐答案

如果我们通过 官方文档.过程很简单.

If we go through the official documentation. The process is simple.

首先,您的起始 url 被解析,然后每个后续抓取的页面链接都将按照提供的规则提取.

First, your start url is parsed and then every subsequent crawled pages links will be extracted by the rules provided.

现在来回答你的问题.

现在我的问题是从 FIRST 中提取的链接LinkExtractor 规则,它们是否只是在调度程序中调度而不是立即跟进?所以在它安排了所有链接之后从第一个 LinkExtractor 中提取,然后它将调用传递响应的所有链接的 first_callback 方法到那个 first_callback?

Now my question is the links that are extracted from the FIRST LinkExtractor rule, are they simply scheduled in the scheduler and not followed immediately? So after it schedules all the links which are extracted from the first LinkExtractor then it will call the first_callback method for all of those links with the response passed to that first_callback?

如果回调为None,则默认为True,否则默认为False.这意味着在您的情况下,将没有后续行动.它从起始 URL 响应中提取的任何链接都是您在调度程序中拥有的内容,并且您的抓取将在解析所有这些内容后结束.

If callback is None follow defaults to True, otherwise it defaults to False. It means in your case, there will be no follow-up. Whatever link it has extracted from the start URL response is what you will have in scheduler and your crawling will end after parsing all these.

如果你想遵守,打破规则.查找您的内容和资源在哪里.

If you want to follow, break the rules. Find where is your content and where are the resourses.

# Extract links matching 'products' (but not matching 'shampoo')
# and follow links from them (since no callback means follow=True by default).
Rule(LinkExtractor(allow=('products', ), deny=('shampoo', ))),

# Extract links matching 'item' and parse them with the spider's method parse_item
Rule(LinkExtractor(allow=('item', )), callback='parse_item'),

现在来回答你的第二个问题:

Now coming to your second question:

还有什么时候会调用第二个 LinkExtractor?是否第一个 LinkExtractor 被评估,然后只有第二个LinkExtractor 运行?

Also when is the second LinkExtractor going to be called? Does the first LinkExtractor get evaluated and then only the second LinkExtractor runs?

一个不依赖于另一个.LinkExtractor 对象独立应用正则表达式或字符串匹配.如果他们找到匹配的 URL,他们会继续进行回调或跟进.

One is not dependent on other. LinkExtractor Object apply regex or string matching independently. If they find their matching URL, they proceed with their callbacks or follow up.

这篇关于在 CrawlSpider 中以什么顺序评估规则?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆