如何像一个简单的脚本一样以编程方式运行一个scrapy蜘蛛? [英] How run a scrapy spider programmatically like a simple script?
问题描述
我创建了一个 Scrapy 蜘蛛.但我想将它作为脚本运行.我怎么能做到这一点.现在我可以在终端中通过这个命令运行:
I created a Scrapy spider. But I wanna run it as a script. How I can do this. Now I am able to run by this command in terminal:
$ scrapy crawl book -o book.json
但我想像一个简单的python脚本一样运行它
But I want to run it like a simple python script
推荐答案
可以不用project直接在python脚本中运行spider.
You can run spider directly in python script without using project.
您必须使用 scrapy.crawler.CrawlerProcess
或 scrapy.crawler.CrawlerRunner
但我不确定它是否具有项目中的所有功能.
You have to use scrapy.crawler.CrawlerProcess
or scrapy.crawler.CrawlerRunner
but I'm not sure if it has all functionality as in project.
在文档中查看更多信息:常见做法
See more in documentation: Common Practices
或者你可以把你的命令放在 Linux 上的 bash 脚本或 Windows 上的 .bat
文件中.
Or you can put your command in bash script on Linux or in .bat
file on Windows.
顺便说一句:在Linux上,您可以在第一行(#!/bin/bash
)中添加shebang并设置属性executable"-
IE.chmod +x your_script
- 它将像正常程序一样运行.
BTW: on Linux you can add shebang in first line (#!/bin/bash
) and set attribute "executable" -
ie. chmod +x your_script
- and it will run as normal program.
工作示例
#!/usr/bin/env python3
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
allowed_domains = ['http://quotes.toqoute.com']
#start_urls = []
#def start_requests(self):
# for tag in self.tags:
# for page in range(self.pages):
# url = self.url_template.format(tag, page)
# yield scrapy.Request(url)
def parse(self, response):
print('url:', response.url)
# --- it runs without project and saves in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
'FEED_FORMAT': 'csv',
'FEED_URI': 'output.csv',
})
c.crawl(MySpider)
c.start()
这篇关于如何像一个简单的脚本一样以编程方式运行一个scrapy蜘蛛?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!