强制蜘蛛停止在scrapy [英] Force spider to stop in scrapy

查看:44
本文介绍了强制蜘蛛停止在scrapy的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一个项目中有 20 个蜘蛛,每个蜘蛛都有不同的任务和要抓取的 URL(但数据相似,我使用共享的 items.pypipelines.py 对于所有这些),顺便说一下,如果某些条件满足指定的蜘蛛停止爬行,我希望在我的管道类中.我已经测试

I have 20 spiders in one project, each spider has different task and URL to crawl ( but data are similar and I'm using shared items.py and pipelines.py for all of them), by the way in my pipelines class I want if some conditions satisfied that specified spider stop crawl. I've testing

  raise DropItem("terminated by me")

 raise CloseSpider('terminate by me')

但它们都只是停止了蜘蛛的当前运行并且next_page url仍在爬行!!!

but both of them just stop the current running of spider and next_page url still crawling !!!

我的 pipelines.py

class MongoDBPipeline(object):

    def __init__(self):
        connection = pymongo.MongoClient(
            settings['MONGODB_SERVER'],
            settings['MONGODB_PORT']
        )
        db = connection[settings['MONGODB_DB']]
        self.collection = db[settings['MONGODB_COLLECTION']]

    def process_item(self, item, spider):
        raise CloseSpider('terminateby')
        raise DropItem("terminateby")

        valid = True
        for data in item:
            if not data:
                valid = False
                raise DropItem("Missing {0}!".format(data))
        if valid:
            self.collection.insert(dict(item))
            log.msg("Items added to MongoDB database!",
                    level=log.DEBUG, spider=spider)
        return item

还有我的蜘蛛

import scrapy
import json
from Maio.items import MaioItem



class ZhilevanSpider(scrapy.Spider):
    name = 'tehran'
    allowed_domains = []
    start_urls = ['https://search.Maio.io/json/']
    place_code = str(1);

    def start_requests(self):

        request_body = {
                "id": 2,
                "jsonrpc": "2.0",
                "method": "getlist",
                "params": [[["myitem", 0, [self.place_code]]], next_pdate]
        }
        # for body in request_body:
        #     request_body = body

        request_body = json.dumps(request_body)
        print(request_body)
        yield scrapy.Request(
            url='https://search.Maio.io/json/',
            method="POST",
            body=request_body,
            callback = self.parse,
            headers={'Content-type' : 'application/json;charset=UTF-8'}
        )

    def parse(self, response):

        print(response)
        # print(response.body.decode('utf-8'))
        input = (response.body.decode('utf-8'))
        result = json.loads(input)
        # for key,item in result["result"]:
        #     print(key)
        next_pdate = result["result"]["last_post_date"];
        print(result["result"]["last_post_date"])
        for item in result["result"]["post_list"]:
            print("title : {0}".format(item["title"]))
            ads = MaioItem()
            ads['title'] = item["title"]
            ads['desc'] = item["desc"]
        yield ads
        if(next_pdate):
            request_body = {
                "id": 2,
                "jsonrpc": "2.0",
                "method": "getlist",
                "params": [[["myitem", 0, [self.place_code]]], next_pdate]
            }

            request_body = json.dumps(request_body)
            yield scrapy.Request(
                url='https://search.Maio.io/json/',
                method="POST",
                body=request_body,
                callback=self.parse,
                headers={'Content-type': 'application/json; charset=UTF-8'}
            )

**更新**

即使我将 sys.exit("SHUT DOWN EVERYTHING!") 放入管道中,但下一页仍在运行.

even I put sys.exit("SHUT DOWN EVERYTHING!") in the pipeline but next page still run .

我在运行的每个页面都看到以下日志

I see the following log in every page running

sys.exit("SHUT DOWN EVERYTHING!")
SystemExit: SHUT DOWN EVERYTHING!

推荐答案

OK 然后就可以使用 CloseSpider 异常了.

OK then you can use CloseSpider exception.

from scrapy.exceptions import CloseSpider
# condition
raise CloseSpider("message")

这篇关于强制蜘蛛停止在scrapy的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆