沙皮蜘蛛的多重继承 [英] Multiple inheritance in scrapy spiders

查看:74
本文介绍了沙皮蜘蛛的多重继承的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有可能创建一个继承自两个基本蜘蛛(即SitemapSpider和CrawlSpider)功能的蜘蛛?

Is it possible to create a spider which inherits the functionality from two base spiders, namely SitemapSpider and CrawlSpider?

我一直在尝试从各个站点抓取数据,并且意识到并非所有站点都列出了该站点上的每个页面,因此需要使用CrawlSpider.但是CrawlSpider经历了很多垃圾页面,实在是太过分了.

I have been trying to scrape data from various sites and realized that not all sites have listing of every page on the website, thus a need to use CrawlSpider. But CrawlSpider goes through a lot of junk pages and is kind of an overkill.

我想做的是这样的:

  1. 启动我的Spider(它是SitemapSpider的子类)并传递正则表达式 匹配对parse_products的响应以提取有用 信息方法.

  1. Start my Spider which is a subclass of SitemapSpider and pass regex matched responses to the parse_products to extract useful information method.

从产品页面转到与正则表达式匹配的链接:/reviews/, 并将数据发送到parse_review函数.
注意:"/reviews/"类型页面未在站点地图中列出

Go to links matching the regex: /reviews/ from the products page, and sending the data to parse_review function.
Note: "/reviews/" type pages are not listed in sitemap

从/reviews/页面中提取信息

Extract information from /reviews/ page

CrawlSpider基本上用于递归爬网和抓取

CrawlSpider is basically for recursive crawls and scraping

---------其他细节-------

-------ADDITIONAL DETAILS-------

有问题的网站是www.flipkart.com 该站点上有很多产品的清单,每个页面都有自己的详细信息页面. 与详细信息页面一起,它们是产品的相应评论"页面.产品详细信息页面上也提供了指向审阅页面的链接.

The site in question is www.flipkart.com The site has listings for a lot of products, with each page having its own detail page. Along with the details page, their is a corresponding "review" page for the product. The link to the review page is also available on the product details page.

注意:评论页未在站点地图上列出.

class WebCrawler(SitemapSpider, CrawlSpider):
    name = "flipkart"
    allowed_domains = ['flipkart.com']
    sitemap_urls = ['http://www.flipkart.com/robots.txt']
    sitemap_rules = [(regex('/(.*?)/p/(.*?)'), 'parse_product')]
    start_urls = ['http://www.flipkart.com/']
    rules = [Rule(LinkExtractor(allow=['/(.*?)/product-reviews/(.*?)']), 'parse_reviews'),
             Rule(LinkExtractor(restrict_xpaths='//div[@class="fk-navigation fk-text-center tmargin10"]'), follow=True)]

    def parse_product(self, response):
        loader = FlipkartItemLoader(response=response)
        loader.add_value('pid', 'value of pid')
        loader.add_xpath('name', 'xpath to name')
        yield loader.load_item()

    def parse_reviews(self, response):
        loader = ReviewItemLoader(response=response)
        loader.add_value('pid','value of pid')
        loader.add_xpath('review_title', 'xpath to review title')
        loader.add_xpath('review_text', 'xpath to review text')
        yield loader.load_item()

推荐答案

您在正确的轨道上,唯一剩下的是parse_product函数的末尾,您必须生成搜寻器提取的所有网址像这样

You are on the right track, the only thing left is at the end of your parse_product function, you have to yield all the urls extracted by the crawler like so

def parse_product(self, response):
    loader = FlipkartItemLoader(response=response)
    loader.add_value('pid', 'value of pid')
    loader.add_xpath('name', 'xpath to name')
    yield loader.load_item()

    # CrawlSpider defines this method to return all scraped urls.
    yield from self.parse(response)

如果您没有yield from语法,则只需使用

If you don't have the yield from syntax then just use

for req in self.parse(response):
    yield req

这篇关于沙皮蜘蛛的多重继承的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆