强制蜘蛛停止在scrapy [英] Force spider to stop in scrapy
问题描述
我在一个项目中有 20 个蜘蛛,每个蜘蛛都有不同的任务和要抓取的 URL(但数据相似,我使用共享的 items.py
和 pipelines.py
对于所有这些),顺便说一下,如果某些条件满足指定的蜘蛛停止爬行,我希望在我的管道类中.我已经测试
I have 20 spiders in one project, each spider has different task and URL to crawl ( but data are similar and I'm using shared items.py
and pipelines.py
for all of them), by the way in my pipelines class I want if some conditions satisfied that specified spider stop crawl.
I've testing
raise DropItem("terminated by me")
和
raise CloseSpider('terminate by me')
但它们都只是停止了蜘蛛的当前运行并且next_page url仍在爬行!!!
but both of them just stop the current running of spider and next_page url still crawling !!!
我的 pipelines.py
class MongoDBPipeline(object):
def __init__(self):
connection = pymongo.MongoClient(
settings['MONGODB_SERVER'],
settings['MONGODB_PORT']
)
db = connection[settings['MONGODB_DB']]
self.collection = db[settings['MONGODB_COLLECTION']]
def process_item(self, item, spider):
raise CloseSpider('terminateby')
raise DropItem("terminateby")
valid = True
for data in item:
if not data:
valid = False
raise DropItem("Missing {0}!".format(data))
if valid:
self.collection.insert(dict(item))
log.msg("Items added to MongoDB database!",
level=log.DEBUG, spider=spider)
return item
还有我的蜘蛛
import scrapy
import json
from Maio.items import MaioItem
class ZhilevanSpider(scrapy.Spider):
name = 'tehran'
allowed_domains = []
start_urls = ['https://search.Maio.io/json/']
place_code = str(1);
def start_requests(self):
request_body = {
"id": 2,
"jsonrpc": "2.0",
"method": "getlist",
"params": [[["myitem", 0, [self.place_code]]], next_pdate]
}
# for body in request_body:
# request_body = body
request_body = json.dumps(request_body)
print(request_body)
yield scrapy.Request(
url='https://search.Maio.io/json/',
method="POST",
body=request_body,
callback = self.parse,
headers={'Content-type' : 'application/json;charset=UTF-8'}
)
def parse(self, response):
print(response)
# print(response.body.decode('utf-8'))
input = (response.body.decode('utf-8'))
result = json.loads(input)
# for key,item in result["result"]:
# print(key)
next_pdate = result["result"]["last_post_date"];
print(result["result"]["last_post_date"])
for item in result["result"]["post_list"]:
print("title : {0}".format(item["title"]))
ads = MaioItem()
ads['title'] = item["title"]
ads['desc'] = item["desc"]
yield ads
if(next_pdate):
request_body = {
"id": 2,
"jsonrpc": "2.0",
"method": "getlist",
"params": [[["myitem", 0, [self.place_code]]], next_pdate]
}
request_body = json.dumps(request_body)
yield scrapy.Request(
url='https://search.Maio.io/json/',
method="POST",
body=request_body,
callback=self.parse,
headers={'Content-type': 'application/json; charset=UTF-8'}
)
**更新**
即使我将 sys.exit("SHUT DOWN EVERYTHING!")
放入管道中,但下一页仍在运行.
even I put sys.exit("SHUT DOWN EVERYTHING!")
in the pipeline but next page still run .
我在运行的每个页面都看到以下日志
I see the following log in every page running
sys.exit("SHUT DOWN EVERYTHING!")
SystemExit: SHUT DOWN EVERYTHING!
推荐答案
OK 然后就可以使用 CloseSpider 异常了.
OK then you can use CloseSpider exception.
from scrapy.exceptions import CloseSpider
# condition
raise CloseSpider("message")
这篇关于强制蜘蛛停止在scrapy的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!