满足条件时关闭scrapy蜘蛛并返回输出对象 [英] Close a scrapy spider when a condition is met and return the output object

查看：43 发布时间：2021/7/16 21:49:13 python scrapy web-crawler screen-scraping

本文介绍了满足条件时关闭scrapy蜘蛛并返回输出对象的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我制作了一个蜘蛛来从这样的页面获取评论此处使用scrapy.我只想要某个日期之前的产品评论(在这种情况下是 2016 年 7 月 2 日).一旦审查日期早于给定日期，我想关闭我的蜘蛛并返回项目列表.Spider 运行良好，但我的问题是，如果满足条件，我将无法关闭我的蜘蛛……如果我引发异常，蜘蛛将关闭而不返回任何内容.请建议手动关闭蜘蛛的最佳方法.这是我的代码:

I have made a spider to get reviews from a page like this here using scrapy. I want product reviews only till a certain date(2nd July 2016 in this case). I want to close my spider as soon as the review date goes earlier than the given date and return the items list. Spider is working well but my problem is that i am not able to close my spider if the condition is met..if i raise an exception, spider closes without returning anything. Please suggest the best way to close the spider manually. Here is my code:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy import Selector
from tars.items import FlipkartProductReviewsItem
import re as r
import unicodedata
from datetime import datetime 

class Freviewspider(CrawlSpider):
    name = "frs"
    allowed_domains = ["flipkart.com"]
    def __init__(self, *args, **kwargs):
        super(Freviewspider, self).__init__(*args, **kwargs)
        self.start_urls = [kwargs.get('start_url')]


    rules = (
        Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="nav_bar_next_prev"]')), callback="parse_start_url", follow= True),
)


    def parse_start_url(self, response):

        hxs = Selector(response)
        titles = hxs.xpath('//div[@class="fclear fk-review fk-position-relative line "]')

        items = []

        for i in titles:

            item = FlipkartProductReviewsItem()

            #x-paths:

            title_xpath = "div[2]/div[1]/strong/text()"
            review_xpath = "div[2]/p/span/text()"
            date_xpath = "div[1]/div[3]/text()"



            #field-values-extraction:

            item["date"] = (''.join(i.xpath(date_xpath).extract())).replace('\n ', '')
            item["title"] = (''.join(i.xpath(title_xpath).extract())).replace('\n ', '')

            review_list = i.xpath(review_xpath).extract()
            temp_list = []
            for element in review_list:
                temp_list.append(element.replace('\n ', '').replace('\n', ''))

            item["review"] = ' '.join(temp_list)

            xxx = datetime.strptime(item["date"], '%d %b %Y ')
            comp_date = datetime.strptime('02 Jul 2016 ', '%d %b %Y ')
            if xxx>comp_date:
                items.append(item)
            else:
                break

        return(items)

满足条件时关闭scrapy蜘蛛并返回输出对象 [英] Close a scrapy spider when a condition is met and return the output object

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

满足条件时关闭scrapy蜘蛛并返回输出对象 [英] Close a scrapy spider when a condition is met and return the output object

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭