在脚本文件函数中获取 Scrapy 爬虫输出/结果 [英] Get Scrapy crawler output/results in script file function
问题描述
我正在使用脚本文件在 scrapy 项目中运行爬虫,爬虫正在记录爬虫输出/结果.但是我想在某个函数的脚本文件中使用蜘蛛输出/结果.我不想将输出/结果保存在任何文件或数据库中.这是从 https://doc.scrapy 获取的脚本代码.org/en/latest/topics/practices.html#run-from-script
I am using a script file to run a spider within scrapy project and spider is logging the crawler output/results. But i want to use spider output/results in that script file in some function .I did not want to save output/results in any file or DB. Here is Script code get from https://doc.scrapy.org/en/latest/topics/practices.html#run-from-script
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner(get_project_settings())
d = runner.crawl('my_spider')
d.addBoth(lambda _: reactor.stop())
reactor.run()
def spider_output(output):
# do something to that output
如何在 'spider_output' 方法中获得蜘蛛输出.可以得到输出/结果.
How can i get spider output in 'spider_output' method. It is possible to get output/results.
推荐答案
这是将所有输出/结果放在一个列表中的解决方案
Here is the solution that get all output/results in a list
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.signalmanager import dispatcher
def spider_results():
results = []
def crawler_results(signal, sender, item, response, spider):
results.append(item)
dispatcher.connect(crawler_results, signal=signals.item_passed)
process = CrawlerProcess(get_project_settings())
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
return results
if __name__ == '__main__':
print(spider_results())
这篇关于在脚本文件函数中获取 Scrapy 爬虫输出/结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!