Scrapy - 了解 CrawlSpider 和 LinkExtractor [英] Scrapy - Understanding CrawlSpider and LinkExtractor

查看：31 发布时间：2021/7/16 21:58:56 python scrapy web-crawler scrapy-spider

本文介绍了Scrapy - 了解 CrawlSpider 和 LinkExtractor的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

所以我正在尝试使用 CrawlSpider 并了解 Scrapy 文档:

So I'm trying to use CrawlSpider and understand the following example in the Scrapy Docs:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

rules = (
    # Extract links matching 'category.php' (but not matching 'subsection.php')
    # and follow links from them (since no callback means follow=True by default).
    Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

    # Extract links matching 'item.php' and parse them with the spider's method parse_item
    Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
)

def parse_item(self, response):
    self.logger.info('Hi, this is an item page! %s', response.url)
    item = scrapy.Item()
    item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
    item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
    item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
    return item

然后给出的描述是:

这个蜘蛛会开始抓取example.com的主页，收集分类链接和项目链接，用parse_item方法解析后者.对于每个项目响应，将使用 XPath 从 HTML 中提取一些数据，并用它填充一个项目.

This spider would start crawling example.com’s home page, collecting category links, and item links, parsing the latter with the parse_item method. For each item response, some data will be extracted from the HTML using XPath, and an Item will be filled with it.

我知道对于第二条规则，它从 item.php 中提取链接，然后使用 parse_item 方法提取信息.然而，第一条规则的目的究竟是什么?它只是说它收集"了链接.这是什么意思，如果他们不从中提取任何数据，为什么它有用?

I understand that for the second rule, it extracts links from item.php and then extracts the information using the parse_item method. However, what exactly is the purpose of the first rule? It just says that it "collects" the links. What does that mean and why is it useful if they are not extracting any data from it?

Scrapy - 了解 CrawlSpider 和 LinkExtractor [英] Scrapy - Understanding CrawlSpider and LinkExtractor

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Scrapy - 了解 CrawlSpider 和 LinkExtractor [英] Scrapy - Understanding CrawlSpider and LinkExtractor

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭