从python脚本调用scrapy不创建JSON输出文件 [英] Calling scrapy from a python script not creating JSON output file

查看:30
本文介绍了从python脚本调用scrapy不创建JSON输出文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我用来调用scrapy的python脚本,

Here's the python script that i am using to call scrapy, the answer of

从脚本爬取后总是会阻止脚本执行刮刮

def stop_reactor():
    reactor.stop()
dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = MySpider(start_url='abc')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
log.msg('Running reactor...')
reactor.run()  # the script will block here until the spider is closed
log.msg('Reactor stopped.')

这是我的 pipelines.py 代码

here's my pipelines.py code

from scrapy import log,signals
from scrapy.contrib.exporter import JsonItemExporter
from scrapy.xlib.pydispatch import dispatcher

class scrapermar11Pipeline(object):


    def __init__(self):
        self.files = {}
        dispatcher.connect(self.spider_opened , signals.spider_opened)
        dispatcher.connect(self.spider_closed , signals.spider_closed)


    def spider_opened(self,spider):
        file = open('links_pipelines.json' ,'wb')
        self.files[spider] = file
        self.exporter = JsonItemExporter(file)
        self.exporter.start_exporting()

    def spider_closed(self,spider):
       self.exporter.finish_exporting()
       file = self.files.pop(spider)
       file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        log.msg('It reached here')
        return item

此代码取自此处

Scrapy :: JSON 导出问题

当我像这样运行爬虫

scrapy crawl MySpider -a start_url='abc'

创建了一个具有预期输出的链接文件.但是当我执行 python 脚本时,它不会创建任何文件,尽管爬虫运行,因为转储的 scrapy 统计信息与上次运行的类似.我认为 python 脚本中存在错误,因为文件是在第一种方法中创建的.如何让脚本输出文件?

a links file with the expected output is created .But when i execute the python script it does not create any file though the crawler runs as the dumped scrapy stats are similar to those of the previous run. I think there's a mistake in the python script as the file is getting created in the first approach .How do i get the script to output the file ?

推荐答案

这段代码对我有用:

from scrapy import signals, log
from scrapy.xlib.pydispatch import dispatcher
from scrapy.conf import settings
from scrapy.http import Request
from multiprocessing.queues import Queue
from scrapy.crawler import CrawlerProcess
from multiprocessing import Process
# import your spider here
def handleSpiderIdle(spider):
        reactor.stop()
mySettings = {'LOG_ENABLED': True, 'ITEM_PIPELINES': '<name of your project>.pipelines.scrapermar11Pipeline'} 

settings.overrides.update(mySettings)

crawlerProcess = CrawlerProcess(settings)
crawlerProcess.install()
crawlerProcess.configure()

spider = <nameofyourspider>(domain="") # create a spider ourselves
crawlerProcess.crawl(spider) # add it to spiders pool

dispatcher.connect(handleSpiderIdle, signals.spider_idle) # use this if you need to handle idle event (restart spider?)

log.start() # depends on LOG_ENABLED
print "Starting crawler."
crawlerProcess.start()
print "Crawler stopped."

这篇关于从python脚本调用scrapy不创建JSON输出文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆