Scrapy 从脚本中抓取总是在抓取后阻止脚本执行 [英] Scrapy crawl from script always blocks script execution after scraping
问题描述
我正在遵循本指南 http://doc.scrapy.org/en/0.16/topics/practices.html#run-scrapy-from-a-script 从我的脚本运行scrapy.这是我的脚本的一部分:
I am following this guide http://doc.scrapy.org/en/0.16/topics/practices.html#run-scrapy-from-a-script to run scrapy from my script. Here is part of my script:
crawler = Crawler(Settings(settings))
crawler.configure()
spider = crawler.spiders.create(spider_name)
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()
print "It can't be printed out!"
它应该适用于:访问页面,抓取所需信息并将输出 json 存储在我告诉它的地方(通过 FEED_URI).但是当蜘蛛完成他的工作时(我可以在输出 json 中通过数字看到它)我的脚本的执行不会恢复.可能不是scrapy问题.答案应该在twisted反应堆的某个地方.我怎样才能释放线程执行?
It works at it should: visits pages, scrape needed info and stores output json where I told it(via FEED_URI). But when spider finishing his work(I can see it by number in output json) execution of my script wouldn't resume. Probably it isn't scrapy problem. And answer should somewhere in twisted's reactor. How could I release thread execution?
推荐答案
当蜘蛛完成时,您需要停止反应器.您可以通过侦听 spider_closed
信号来完成此操作:
You will need to stop the reactor when the spider finishes. You can accomplish this by listening for the spider_closed
signal:
from twisted.internet import reactor
from scrapy import log, signals
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy.xlib.pydispatch import dispatcher
from testspiders.spiders.followall import FollowAllSpider
def stop_reactor():
reactor.stop()
dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = FollowAllSpider(domain='scrapinghub.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
log.msg('Running reactor...')
reactor.run() # the script will block here until the spider is closed
log.msg('Reactor stopped.')
命令行日志输出可能类似于:
And the command line log output might look something like:
stav@maia:/srv/scrapy/testspiders$ ./api
2013-02-10 14:49:38-0600 [scrapy] INFO: Running reactor...
2013-02-10 14:49:47-0600 [followall] INFO: Closing spider (finished)
2013-02-10 14:49:47-0600 [followall] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 23934,...}
2013-02-10 14:49:47-0600 [followall] INFO: Spider closed (finished)
2013-02-10 14:49:47-0600 [scrapy] INFO: Reactor stopped.
stav@maia:/srv/scrapy/testspiders$
这篇关于Scrapy 从脚本中抓取总是在抓取后阻止脚本执行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!