我可以在项目目录外执行scrapy(python) crawl吗? [英] Can i execute scrapy(python) crawl outside the project dir?

查看:105
本文介绍了我可以在项目目录外执行scrapy(python) crawl吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

文档说我只能在项目目录中执行 crawl 命令:

scrapy爬取导师 -o items.json -t json

但我真的需要在我的python代码中执行它(python文件不在当前项目目录中)

有什么方法适合我的要求吗?

我的项目树:

<预><代码>.├── tao│ ├── etao│ │ ├── __init__.py│ │ ├── items.py│ │ ├── pipelines.py│ │ ├── settings.py│ │ └── 蜘蛛│ │ ├── __init__.py│ │ ├── etao_spider.py│ ├── items.json│ ├── scrapy.cfg│ └── start.py└── start.py <-------------- 我想在这里执行脚本.

我的任何代码都遵循了这个链接,但它不起作用:

#!/usr/bin/env python导入操作系统#必须在其他导入之前的顶部os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'project.settings')从scrapy导入项目从scrapy.conf导入设置从 scrapy.crawler 导入 CrawlerProcess类 CrawlerScript():def __init__(self):self.crawler = CrawlerProcess(settings)如果不是 hasattr(project, 'crawler'):self.crawler.install()self.crawler.configure()def crawl(self, spider_name):蜘蛛 = self.crawler.spider.create(spider_name) <--- 第 19 行如果蜘蛛:self.crawler.queue.append_spider(蜘蛛)self.crawler.start()self.crawler.stop()# 主要的如果 __name__ == '__main__':爬虫 = CrawlerScript()crawler.crawl('etao')

错误是:

line 19: KeyError: 'Spider not found: etao'

解决方案

你实际上可以自己调用 crawlprocess...

有点像

from scrapy.crawler import CrawlerProcess从scrapy.conf导入设置settings.overrides.update({}) # 你的设置crawlerProcess = CrawlerProcess(设置)crawlerProcess.install()crawlerProcess.configure()crawlerProcess.crawl(spider) # 你的蜘蛛在这里

感谢@warwaruk.

The docs says i could only execute the crawl command inside the project dir :

scrapy crawl tutor -o items.json -t json

but i really need to execute it in my python code (the python file is not inside current project dir)

Is there any approach fit my requirement ?

My project tree:

.
├── etao
│   ├── etao
│   │   ├── __init__.py
│   │   ├── items.py
│   │   ├── pipelines.py
│   │   ├── settings.py
│   │   └── spiders
│   │       ├── __init__.py
│   │       ├── etao_spider.py
│   ├── items.json
│   ├── scrapy.cfg
│   └── start.py
└── start.py    <-------------- I want to execute the script here.

Any here's my code followed this link but it doesn't work:

#!/usr/bin/env python
import os
#Must be at the top before other imports
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'project.settings')

from scrapy import project
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess

class CrawlerScript():

  def __init__(self):
    self.crawler = CrawlerProcess(settings)
    if not hasattr(project, 'crawler'):
      self.crawler.install()
    self.crawler.configure()

  def crawl(self, spider_name):
    spider = self.crawler.spiders.create(spider_name)   <--- line 19
    if spider:
      self.crawler.queue.append_spider(spider)
    self.crawler.start()
    self.crawler.stop()


# main
if __name__ == '__main__':
  crawler = CrawlerScript()
  crawler.crawl('etao')

the error is:

line 19: KeyError: 'Spider not found: etao'

解决方案

you can actually call the crawlprocess yourself...

its something like

from scrapy.crawler import CrawlerProcess
from scrapy.conf import settings


settings.overrides.update({}) # your settings

crawlerProcess = CrawlerProcess(settings)
crawlerProcess.install()
crawlerProcess.configure()

crawlerProcess.crawl(spider) # your spider here

Credits to @warwaruk.

这篇关于我可以在项目目录外执行scrapy(python) crawl吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆