如何构建一个基于Scrapy的网络爬虫永远运行? [英] How to build a web crawler based on Scrapy to run forever?
问题描述
我想搭建一个基于Scrapy的网络爬虫,抓取几个新闻门户网站的新闻图片.我希望这个爬虫是:
I want to build a web crawler based on Scrapy to grab news pictures from several news portal website. I want to this crawler to be:
永远运行
Run forever
意味着它会定期重新访问一些门户页面以获取更新.
Means it will periodical re-visit some portal pages to get updates.
安排优先级.
为不同类型的 URL 分配不同的优先级.
Give different priorities to different type of URLs.
多线程获取
我已经阅读了 Scrapy 文档,但没有找到与我列出的内容相关的内容(也许我不够小心).这里有人知道怎么做吗?或者只是给出一些关于它的想法/例子.谢谢!
I've read the Scrapy document but have not found something related to what I listed (maybe I am not careful enough). Is there anyone here know how to do that ? or just give some idea/example about it. Thanks!
推荐答案
Scrapy 是一个爬取网站的框架,因此,它旨在支持您的标准,但它不会为您跳舞盒子;对于某些任务,您可能必须相对熟悉该模块.
Scrapy is a framework for the spidering of websites, as such, it is intended to support your criteria but it isn't going to dance for you out of the box; you will probably have to get relatively familiar with the module for some tasks.
- 永远运行取决于调用 Scrapy 的应用程序.你告诉蜘蛛去哪里以及什么时候去.
- 确定优先级是调度程序中间件的工作,您必须这样做创建并插入 Scrapy.关于此的文档似乎参差不齐,我没有看过代码 - 原则上该函数在那里.
- Scrapy 本质上是根本上是异步的,这很可能是你想要的是:当请求 A 仍然未决时,请求 B 可以得到满足.底层连接引擎不会阻止您真正多线程,但 Scrapy 不提供线程服务.
- Running forever is up to your application that calls Scrapy. You tell the spiders where to go and when to go there.
- Giving priorities is the job of Scheduler middleware which you'd have to create and plug into Scrapy. The documentation on this appears spotty and I've not looked at the code - in principle the function is there.
- Scrapy is inherently, fundamentally asynchronous which may well be what you are desiring: request B can be satisfied while request A is still outstanding. The underlying connection engine does not prevent you from bona fide multi-threading, but Scrapy doesn't provide threading services.
Scrapy 是一个库,而不是一个应用程序.模块的用户需要进行大量工作(代码).
Scrapy is a library, not an application. There is a non-trivial amount of work (code) that a user of the module needs to make.
这篇关于如何构建一个基于Scrapy的网络爬虫永远运行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!