如何构建一个基于Scrapy的网络爬虫永远运行? [英] How to build a web crawler based on Scrapy to run forever?

查看:26
本文介绍了如何构建一个基于Scrapy的网络爬虫永远运行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想搭建一个基于Scrapy的网络爬虫,抓取几个新闻门户网站的新闻图片.我希望这个爬虫是:

I want to build a web crawler based on Scrapy to grab news pictures from several news portal website. I want to this crawler to be:

  1. 永远运行

  1. Run forever

意味着它会定期重新访问一些门户页面以获取更新.

Means it will periodical re-visit some portal pages to get updates.

安排优先级.

为不同类型的 URL 分配不同的优先级.

Give different priorities to different type of URLs.

多线程获取

我已经阅读了 Scrapy 文档,但没有找到与我列出的内容相关的内容(也许我不够小心).这里有人知道怎么做吗?或者只是给出一些关于它的想法/例子.谢谢!

I've read the Scrapy document but have not found something related to what I listed (maybe I am not careful enough). Is there anyone here know how to do that ? or just give some idea/example about it. Thanks!

推荐答案

Scrapy 是一个爬取网站的框架,因此,它旨在支持您的标准,但它不会为您跳舞盒子;对于某些任务,您可能必须相对熟悉该模块.

Scrapy is a framework for the spidering of websites, as such, it is intended to support your criteria but it isn't going to dance for you out of the box; you will probably have to get relatively familiar with the module for some tasks.

  1. 永远运行取决于调用 Scrapy 的应用程序.你告诉蜘蛛去哪里以及什么时候去.
  2. 确定优先级是调度程序中间件的工作,您必须这样做创建并插入 Scrapy.关于此的文档似乎参差不齐,我没有看过代码 - 原则上该函数在那里.
  3. Scrapy 本质上是根本上是异步的,这很可能是你想要的是:当请求 A 仍然未决时,请求 B 可以得到满足.底层连接引擎不会阻止您真正多线程,但 Scrapy 不提供线程服务.
  1. Running forever is up to your application that calls Scrapy. You tell the spiders where to go and when to go there.
  2. Giving priorities is the job of Scheduler middleware which you'd have to create and plug into Scrapy. The documentation on this appears spotty and I've not looked at the code - in principle the function is there.
  3. Scrapy is inherently, fundamentally asynchronous which may well be what you are desiring: request B can be satisfied while request A is still outstanding. The underlying connection engine does not prevent you from bona fide multi-threading, but Scrapy doesn't provide threading services.

Scrapy 是一个库,而不是一个应用程序.模块的用户需要进行大量工作(代码).

Scrapy is a library, not an application. There is a non-trivial amount of work (code) that a user of the module needs to make.

这篇关于如何构建一个基于Scrapy的网络爬虫永远运行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆