满足条件时关闭scrapy蜘蛛并返回输出对象 [英] Close a scrapy spider when a condition is met and return the output object
问题描述
我制作了一个蜘蛛来从这样的页面获取评论此处 使用scrapy.我只想要某个日期之前的产品评论(在这种情况下是 2016 年 7 月 2 日).一旦审查日期早于给定日期,我想关闭我的蜘蛛并返回项目列表.Spider 运行良好,但我的问题是,如果满足条件,我将无法关闭我的蜘蛛……如果我引发异常,蜘蛛将关闭而不返回任何内容.请建议手动关闭蜘蛛的最佳方法.这是我的代码:
I have made a spider to get reviews from a page like this here using scrapy. I want product reviews only till a certain date(2nd July 2016 in this case). I want to close my spider as soon as the review date goes earlier than the given date and return the items list. Spider is working well but my problem is that i am not able to close my spider if the condition is met..if i raise an exception, spider closes without returning anything. Please suggest the best way to close the spider manually. Here is my code:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy import Selector
from tars.items import FlipkartProductReviewsItem
import re as r
import unicodedata
from datetime import datetime
class Freviewspider(CrawlSpider):
name = "frs"
allowed_domains = ["flipkart.com"]
def __init__(self, *args, **kwargs):
super(Freviewspider, self).__init__(*args, **kwargs)
self.start_urls = [kwargs.get('start_url')]
rules = (
Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="nav_bar_next_prev"]')), callback="parse_start_url", follow= True),
)
def parse_start_url(self, response):
hxs = Selector(response)
titles = hxs.xpath('//div[@class="fclear fk-review fk-position-relative line "]')
items = []
for i in titles:
item = FlipkartProductReviewsItem()
#x-paths:
title_xpath = "div[2]/div[1]/strong/text()"
review_xpath = "div[2]/p/span/text()"
date_xpath = "div[1]/div[3]/text()"
#field-values-extraction:
item["date"] = (''.join(i.xpath(date_xpath).extract())).replace('\n ', '')
item["title"] = (''.join(i.xpath(title_xpath).extract())).replace('\n ', '')
review_list = i.xpath(review_xpath).extract()
temp_list = []
for element in review_list:
temp_list.append(element.replace('\n ', '').replace('\n', ''))
item["review"] = ' '.join(temp_list)
xxx = datetime.strptime(item["date"], '%d %b %Y ')
comp_date = datetime.strptime('02 Jul 2016 ', '%d %b %Y ')
if xxx>comp_date:
items.append(item)
else:
break
return(items)
推荐答案
要强制关闭蜘蛛,您可以使用引发 CloseSpider
异常,如这里是scrapy docs.请务必在引发异常之前退回/让出您的物品.
To force spider to close you can use raise CloseSpider
exception as described here in scrapy docs. Just be sure to return/yield your items before you raise the exception.
这篇关于满足条件时关闭scrapy蜘蛛并返回输出对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!