爬虫蜘蛛中的多重继承 [英] Multiple inheritance in scrapy spiders

查看:38
本文介绍了爬虫蜘蛛中的多重继承的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以创建一个从两个基本蜘蛛(即 SitemapSpider 和 CrawlSpider)继承功能的蜘蛛?

Is it possible to create a spider which inherits the functionality from two base spiders, namely SitemapSpider and CrawlSpider?

我一直在尝试从各个站点抓取数据,并意识到并非所有站点都列出了网站上的每个页面,因此需要使用 CrawlSpider.但是 CrawlSpider 浏览了很多垃圾页面,有点矫枉过正.

I have been trying to scrape data from various sites and realized that not all sites have listing of every page on the website, thus a need to use CrawlSpider. But CrawlSpider goes through a lot of junk pages and is kind of an overkill.

我想做的是这样的:

  1. 启动作为 SitemapSpider 子类的我的 Spider 并传递正则表达式匹配对 parse_products 的响应以提取有用的信息方法.

  1. Start my Spider which is a subclass of SitemapSpider and pass regex matched responses to the parse_products to extract useful information method.

从产品页面转到与正则表达式匹配的链接:/reviews/,并将数据发送到 parse_review 函数.
注意:/reviews/"类型页面未在站点地图中列出

Go to links matching the regex: /reviews/ from the products page, and sending the data to parse_review function.
Note: "/reviews/" type pages are not listed in sitemap

从/reviews/页面提取信息

Extract information from /reviews/ page

CrawlSpider 基本上用于递归爬取和抓取

CrawlSpider is basically for recursive crawls and scraping

-------附加细节-------

-------ADDITIONAL DETAILS-------

有问题的网站是 www.flipkart.com该网站有很多产品的列表,每个页面都有自己的详细信息页面.与详细信息页面一起,它们是产品的相应评论"页面.产品详细信息页面上也提供了评论页面的链接.

The site in question is www.flipkart.com The site has listings for a lot of products, with each page having its own detail page. Along with the details page, their is a corresponding "review" page for the product. The link to the review page is also available on the product details page.

注意:评论页面未列在站点地图上.

class WebCrawler(SitemapSpider, CrawlSpider):
    name = "flipkart"
    allowed_domains = ['flipkart.com']
    sitemap_urls = ['http://www.flipkart.com/robots.txt']
    sitemap_rules = [(regex('/(.*?)/p/(.*?)'), 'parse_product')]
    start_urls = ['http://www.flipkart.com/']
    rules = [Rule(LinkExtractor(allow=['/(.*?)/product-reviews/(.*?)']), 'parse_reviews'),
             Rule(LinkExtractor(restrict_xpaths='//div[@class="fk-navigation fk-text-center tmargin10"]'), follow=True)]

    def parse_product(self, response):
        loader = FlipkartItemLoader(response=response)
        loader.add_value('pid', 'value of pid')
        loader.add_xpath('name', 'xpath to name')
        yield loader.load_item()

    def parse_reviews(self, response):
        loader = ReviewItemLoader(response=response)
        loader.add_value('pid','value of pid')
        loader.add_xpath('review_title', 'xpath to review title')
        loader.add_xpath('review_text', 'xpath to review text')
        yield loader.load_item()

推荐答案

你走在正确的轨道上,唯一剩下的就是在你的 parse_product 函数的末尾,你必须让出所有的像这样被爬虫提取的网址

You are on the right track, the only thing left is at the end of your parse_product function, you have to yield all the urls extracted by the crawler like so

def parse_product(self, response):
    loader = FlipkartItemLoader(response=response)
    loader.add_value('pid', 'value of pid')
    loader.add_xpath('name', 'xpath to name')
    yield loader.load_item()

    # CrawlSpider defines this method to return all scraped urls.
    yield from self.parse(response)

如果您没有 yield from 语法,那么只需使用

If you don't have the yield from syntax then just use

for req in self.parse(response):
    yield req

这篇关于爬虫蜘蛛中的多重继承的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆