如何从 Python 脚本中运行 Scrapy [英] How to run Scrapy from within a Python script

查看：57 发布时间：2021/6/25 19:32:13 python web-scraping web-crawler scrapy

本文介绍了如何从 Python 脚本中运行 Scrapy的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是 Scrapy 的新手，我正在寻找一种从 Python 脚本运行它的方法.我找到了 2 个解释这一点的来源:

I'm new to Scrapy and I'm looking for a way to run it from a Python script. I found 2 sources that explain this:

http://tryolabs.com/Blog/2011/09/27/调用-scrapy-python-script/

http://snipplr.com/view/67006/using-scrapy-from-一个脚本/

我不知道应该把蜘蛛代码放在哪里，以及如何从主函数调用它.请帮忙.这是示例代码:

I can't figure out where I should put my spider code and how to call it from the main function. Please help. This is the example code:

# This snippet can be used to run scrapy spiders independent of scrapyd or the scrapy command line tool and use it from a script. 
# 
# The multiprocessing library is used in order to work around a bug in Twisted, in which you cannot restart an already running reactor or in this case a scrapy instance.
# 
# [Here](http://groups.google.com/group/scrapy-users/browse_thread/thread/f332fc5b749d401a) is the mailing-list discussion for this snippet. 

#!/usr/bin/python
import os
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'project.settings') #Must be at the top before other imports

from scrapy import log, signals, project
from scrapy.xlib.pydispatch import dispatcher
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from multiprocessing import Process, Queue

class CrawlerScript():

    def __init__(self):
        self.crawler = CrawlerProcess(settings)
        if not hasattr(project, 'crawler'):
            self.crawler.install()
        self.crawler.configure()
        self.items = []
        dispatcher.connect(self._item_passed, signals.item_passed)

    def _item_passed(self, item):
        self.items.append(item)

    def _crawl(self, queue, spider_name):
        spider = self.crawler.spiders.create(spider_name)
        if spider:
            self.crawler.queue.append_spider(spider)
        self.crawler.start()
        self.crawler.stop()
        queue.put(self.items)

    def crawl(self, spider):
        queue = Queue()
        p = Process(target=self._crawl, args=(queue, spider,))
        p.start()
        p.join()
        return queue.get(True)

# Usage
if __name__ == "__main__":
    log.start()

    """
    This example runs spider1 and then spider2 three times. 
    """
    items = list()
    crawler = CrawlerScript()
    items.append(crawler.crawl('spider1'))
    for i in range(3):
        items.append(crawler.crawl('spider2'))
    print items

# Snippet imported from snippets.scrapy.org (which no longer works)
# author: joehillen
# date  : Oct 24, 2010

谢谢.

如何从 Python 脚本中运行 Scrapy [英] How to run Scrapy from within a Python script

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何从 Python 脚本中运行 Scrapy [英] How to run Scrapy from within a Python script

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭