Scrapy 从脚本中抓取总是在抓取后阻止脚本执行 [英] Scrapy crawl from script always blocks script execution after scraping

查看:55
本文介绍了Scrapy 从脚本中抓取总是在抓取后阻止脚本执行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在遵循本指南 http://doc.scrapy.org/en/0.16/topics/practices.html#run-scrapy-from-a-script 从我的脚本运行scrapy.这是我的脚本的一部分:

I am following this guide http://doc.scrapy.org/en/0.16/topics/practices.html#run-scrapy-from-a-script to run scrapy from my script. Here is part of my script:

    crawler = Crawler(Settings(settings))
    crawler.configure()
    spider = crawler.spiders.create(spider_name)
    crawler.crawl(spider)
    crawler.start()
    log.start()
    reactor.run()
    print "It can't be printed out!"

它应该适用于:访问页面,抓取所需信息并将输出 json 存储在我告诉它的地方(通过 FEED_URI).但是当蜘蛛完成他的工作时(我可以在输出 json 中通过数字看到它)我的脚本的执行不会恢复.可能不是scrapy问题.答案应该在twisted反应堆的某个地方.我怎样才能释放线程执行?

It works at it should: visits pages, scrape needed info and stores output json where I told it(via FEED_URI). But when spider finishing his work(I can see it by number in output json) execution of my script wouldn't resume. Probably it isn't scrapy problem. And answer should somewhere in twisted's reactor. How could I release thread execution?

推荐答案

当蜘蛛完成时,您需要停止反应器.您可以通过侦听 spider_closed 信号来完成此操作:

You will need to stop the reactor when the spider finishes. You can accomplish this by listening for the spider_closed signal:

from twisted.internet import reactor

from scrapy import log, signals
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy.xlib.pydispatch import dispatcher

from testspiders.spiders.followall import FollowAllSpider

def stop_reactor():
    reactor.stop()

dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = FollowAllSpider(domain='scrapinghub.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
log.msg('Running reactor...')
reactor.run()  # the script will block here until the spider is closed
log.msg('Reactor stopped.')

命令行日志输出可能类似于:

And the command line log output might look something like:

stav@maia:/srv/scrapy/testspiders$ ./api
2013-02-10 14:49:38-0600 [scrapy] INFO: Running reactor...
2013-02-10 14:49:47-0600 [followall] INFO: Closing spider (finished)
2013-02-10 14:49:47-0600 [followall] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 23934,...}
2013-02-10 14:49:47-0600 [followall] INFO: Spider closed (finished)
2013-02-10 14:49:47-0600 [scrapy] INFO: Reactor stopped.
stav@maia:/srv/scrapy/testspiders$

这篇关于Scrapy 从脚本中抓取总是在抓取后阻止脚本执行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆