scrapy如何使用规则? [英] How does scrapy use rules?

查看:42
本文介绍了scrapy如何使用规则?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 Scrapy 的新手,我想了解这些规则是如何在 CrawlSpider 中使用的.

I'm new to using Scrapy and I wanted to understand how the rules are being used within the CrawlSpider.

如果我有一个规则,让我在亚利桑那州图森的纸杯蛋糕列表的黄页中爬行,那么产生 URL 请求如何激活该规则 - 特别是它如何激活 restrict_xpath 属性?

If I have a rule where I'm crawling through the yellowpages for cupcake listings in Tucson, AZ, how does yielding a URL request activate the rule - specifically how does it activiate the restrict_xpath attribute?

谢谢.

推荐答案

CrawlSpider 指定如何从页面中提取链接以及应为这些链接调用哪些回调.它们由该类中实现的默认 parse() 方法处理 - 看这里阅读源码.

所以,每当你想触发一个 URL 的规则时,你只需要产生一个 scrapy.Request(url, self.parse),然后 Scrapy 引擎会向它发送一个请求URL 并将规则应用于响应.

So, whenever you want to trigger the rules for an URL, you just need to yield a scrapy.Request(url, self.parse), and the Scrapy engine will send a request to that URL and apply the rules to the response.

链接的提取(可能使用也可能不使用 restrict_xpaths)由 LinkExtractor 为该规则注册的对象.它基本上搜索整个页面中的所有 s 和 s 元素或仅在应用 restrict_xpaths<后获得的元素中搜索/code> 表达式(如果设置了属性).

The extraction of the links (that may or may not use restrict_xpaths) is done by the LinkExtractor object registered for that rule. It basically searches for all the <a>s and <area>s elements in the whole page or only in the elements obtained after applying the restrict_xpaths expressions if the attribute is set.

例如,假设您有一个像这样的 CrawlSpider:

For example, say you have a CrawlSpider like so:

from scrapy.contrib.spiders.crawl import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    start_urls = ['http://someurlhere.com']
    rules = (
        Rule(
            LinkExtractor(restrict_xpaths=[
                "//ul[@class='menu-categories']",
                "//ul[@class='menu-subcategories']"]),
            callback='parse'
        ),
        Rule(
            LinkExtractor(allow='/product.php?id=\d+'),
            callback='parse_product_page'
        ),
    )

    def parse_product_page(self, response):
        # yield product item here

引擎开始向 start_urls 中的 url 发送请求,并为它们的响应执行默认回调(CrawlSpider 中的 parse() 方法).

The engine starts sending requests to the urls in start_urls and executing the default callback (the parse() method in CrawlSpider) for their response.

对于每个响应,parse() 方法将在其上执行链接提取器以从页面获取链接.即,它为每个响应对象调用 LinkExtractor.extract_links(response) 以获取 url,然后生成 scrapy.Request(url, <rule_callback>) 对象.

For each response, the parse() method will execute the link extractors on it to get the links from the page. Namely, it calls the LinkExtractor.extract_links(response) for each response object to get the urls, and then yields scrapy.Request(url, <rule_callback>) objects.

示例代码是蜘蛛的骨架,它按照产品类别和子类别的链接抓取电子商务网站,以获取每个产品页面的链接.

The example code is an skeleton for a spider that crawls an e-commerce site following the links of product categories and subcategories, to get links for each of the product pages.

对于在这个蜘蛛中专门注册的规则,它会以parse()方法作为回调(这将触发抓取规则)抓取类别"和子类别"列表中的链接为这些页面调用),以及与正则表达式 product.php?id=\d+ 与回调 parse_product_page() 匹配的链接——这将最终抓取产品数据.

For the rules registered specifically in this spider, it would crawl the links inside the lists of "categories" and "subcategories" with the parse() method as callback (which will trigger the crawl rules to be called for these pages), and the links matching the regular expression product.php?id=\d+ with the callback parse_product_page() -- which would finally scrape the product data.

如您所见,非常强大的东西.=)

As you can see, pretty powerful stuff. =)

这篇关于scrapy如何使用规则?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆