Scrapy - 了解 CrawlSpider 和 LinkExtractor [英] Scrapy - Understanding CrawlSpider and LinkExtractor

查看:31
本文介绍了Scrapy - 了解 CrawlSpider 和 LinkExtractor的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我正在尝试使用 CrawlSpider 并了解 Scrapy 文档:

So I'm trying to use CrawlSpider and understand the following example in the Scrapy Docs:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

rules = (
    # Extract links matching 'category.php' (but not matching 'subsection.php')
    # and follow links from them (since no callback means follow=True by default).
    Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

    # Extract links matching 'item.php' and parse them with the spider's method parse_item
    Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
)

def parse_item(self, response):
    self.logger.info('Hi, this is an item page! %s', response.url)
    item = scrapy.Item()
    item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
    item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
    item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
    return item

然后给出的描述是:

这个蜘蛛会开始抓取example.com的主页,收集分类链接和项目链接,用parse_item方法解析后者.对于每个项目响应,将使用 XPath 从 HTML 中提取一些数据,并用它填充一个项目.

This spider would start crawling example.com’s home page, collecting category links, and item links, parsing the latter with the parse_item method. For each item response, some data will be extracted from the HTML using XPath, and an Item will be filled with it.

我知道对于第二条规则,它从 item.php 中提取链接,然后使用 parse_item 方法提取信息.然而,第一条规则的目的究竟是什么?它只是说它收集"了链接.这是什么意思,如果他们不从中提取任何数据,为什么它有用?

I understand that for the second rule, it extracts links from item.php and then extracts the information using the parse_item method. However, what exactly is the purpose of the first rule? It just says that it "collects" the links. What does that mean and why is it useful if they are not extracting any data from it?

推荐答案

CrawlSpider 在抓取论坛搜索帖子等时非常有用,或者在搜索产品页面时对在线商店进行分类.

CrawlSpider is very useful when crawling forums searching for posts for example, or categorized online stores when searching for product pages.

这个想法是,您必须以某种方式"进入每个类别,搜索与您要提取的产品/商品信息相对应的链接.这些产品链接是在该示例的第二条规则中指定的链接(它表示网址中包含 item.php 的链接).

The idea is that "somehow" you have to go into each category, searching for links that correspond to product/item information you want to extract. Those product links are the ones specified on the second rule of that example (it says the ones that have item.php in the url).

现在蜘蛛如何继续访问链接直到找到包含item.php的链接?这是第一条规则.它说访问每个包含 category.php 但不包含 subsection.php 的链接,这意味着它不会从这些链接中提取任何项目",但它定义了蜘蛛寻找真实物品的路径.

Now how should the spider keep visiting links until finding those containing item.php? that's the first rule for. It says to visit every Link containing category.php but not subsection.php, which means it won't exactly extract any "item" from those links, but it defines the path of the spider to find the real items.

这就是为什么您看到它在规则中不包含 callback 方法的原因,因为它不会返回该链接响应供您处理,因为它将被直接跟踪.

That's why you see it doesn't contain a callback method inside the rule, as it won't return that link response for you to process, because it will be directly followed.

这篇关于Scrapy - 了解 CrawlSpider 和 LinkExtractor的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆