满足条件时关闭scrapy蜘蛛并返回输出对象 [英] Close a scrapy spider when a condition is met and return the output object

查看:43
本文介绍了满足条件时关闭scrapy蜘蛛并返回输出对象的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我制作了一个蜘蛛来从这样的页面获取评论此处 使用scrapy.我只想要某个日期之前的产品评论(在这种情况下是 2016 年 7 月 2 日).一旦审查日期早于给定日期,我想关闭我的蜘蛛并返回项目列表.Spider 运行良好,但我的问题是,如果满足条件,我将无法关闭我的蜘蛛……如果我引发异常,蜘蛛将关闭而不返回任何内容.请建议手动关闭蜘蛛的最佳方法.这是我的代码:

I have made a spider to get reviews from a page like this here using scrapy. I want product reviews only till a certain date(2nd July 2016 in this case). I want to close my spider as soon as the review date goes earlier than the given date and return the items list. Spider is working well but my problem is that i am not able to close my spider if the condition is met..if i raise an exception, spider closes without returning anything. Please suggest the best way to close the spider manually. Here is my code:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy import Selector
from tars.items import FlipkartProductReviewsItem
import re as r
import unicodedata
from datetime import datetime 

class Freviewspider(CrawlSpider):
    name = "frs"
    allowed_domains = ["flipkart.com"]
    def __init__(self, *args, **kwargs):
        super(Freviewspider, self).__init__(*args, **kwargs)
        self.start_urls = [kwargs.get('start_url')]


    rules = (
        Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="nav_bar_next_prev"]')), callback="parse_start_url", follow= True),
)


    def parse_start_url(self, response):

        hxs = Selector(response)
        titles = hxs.xpath('//div[@class="fclear fk-review fk-position-relative line "]')

        items = []

        for i in titles:

            item = FlipkartProductReviewsItem()

            #x-paths:

            title_xpath = "div[2]/div[1]/strong/text()"
            review_xpath = "div[2]/p/span/text()"
            date_xpath = "div[1]/div[3]/text()"



            #field-values-extraction:

            item["date"] = (''.join(i.xpath(date_xpath).extract())).replace('\n ', '')
            item["title"] = (''.join(i.xpath(title_xpath).extract())).replace('\n ', '')

            review_list = i.xpath(review_xpath).extract()
            temp_list = []
            for element in review_list:
                temp_list.append(element.replace('\n ', '').replace('\n', ''))

            item["review"] = ' '.join(temp_list)

            xxx = datetime.strptime(item["date"], '%d %b %Y ')
            comp_date = datetime.strptime('02 Jul 2016 ', '%d %b %Y ')
            if xxx>comp_date:
                items.append(item)
            else:
                break

        return(items)

推荐答案

要强制关闭蜘蛛,您可以使用引发 CloseSpider 异常,如这里是scrapy docs.请务必在引发异常之前退回/让出您的物品.

To force spider to close you can use raise CloseSpider exception as described here in scrapy docs. Just be sure to return/yield your items before you raise the exception.

这篇关于满足条件时关闭scrapy蜘蛛并返回输出对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆