在脚本文件函数中获取 Scrapy 爬虫输出/结果 [英] Get Scrapy crawler output/results in script file function

查看:20
本文介绍了在脚本文件函数中获取 Scrapy 爬虫输出/结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用脚本文件在 scrapy 项目中运行爬虫,爬虫正在记录爬虫输出/结果.但是我想在某个函数的脚本文件中使用蜘蛛输出/结果.我不想将输出/结果保存在任何文件或数据库中.这是从 https://doc.scrapy 获取的脚本代码.org/en/latest/topics/practices.html#run-from-script

I am using a script file to run a spider within scrapy project and spider is logging the crawler output/results. But i want to use spider output/results in that script file in some function .I did not want to save output/results in any file or DB. Here is Script code get from https://doc.scrapy.org/en/latest/topics/practices.html#run-from-script

from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings

configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner(get_project_settings())


d = runner.crawl('my_spider')
d.addBoth(lambda _: reactor.stop())
reactor.run()

def spider_output(output):
#     do something to that output

如何在 'spider_output' 方法中获得蜘蛛输出.可以得到输出/结果.

How can i get spider output in 'spider_output' method. It is possible to get output/results.

推荐答案

这是将所有输出/结果放在一个列表中的解决方案

Here is the solution that get all output/results in a list

from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

from scrapy.signalmanager import dispatcher


def spider_results():
    results = []

    def crawler_results(signal, sender, item, response, spider):
        results.append(item)

    dispatcher.connect(crawler_results, signal=signals.item_passed)

    process = CrawlerProcess(get_project_settings())
    process.crawl(MySpider)
    process.start()  # the script will block here until the crawling is finished
    return results


if __name__ == '__main__':
    print(spider_results())

这篇关于在脚本文件函数中获取 Scrapy 爬虫输出/结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆