Scrapy 可以被 pyspider 取代吗? [英] Can Scrapy be replaced by pyspider?

查看:109
本文介绍了Scrapy 可以被 pyspider 取代吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在广泛使用 Scrapy 网络抓取框架,但是,最近我发现还有另一个框架/系统叫做 pyspider,根据它的 github 页面,它是新鲜、积极开发和流行的.

I've been using Scrapy web-scraping framework pretty extensively, but, recently I've discovered that there is another framework/system called pyspider, which, according to it's github page, is fresh, actively developed and popular.

pyspider 的主页列出了一些现成支持的内容:

pyspider's home page lists several things being supported out-of-the-box:

  • 带有脚本编辑器、任务监视器、项目经理和结果查看器的强大 WebUI

  • Powerful WebUI with script editor, task monitor, project manager and result viewer

支持Javascript页面!

Javascript pages supported!

任务优先级、重试、周期性和按年龄或索引页面中的标记(如更新时间)重新抓取

Task priority, retry, periodical and recrawl by age or marks in index page (like update time)

分布式架构

这些是 Scrapy 本身不提供的东西,但是,在 portia(用于 Web UI),scrapyjs(用于 js 页面)和 scrapyd(通过 API 部署和分发).

These are the things that Scrapy itself doesn't provide, but, it is possible with the help of portia (for Web UI), scrapyjs (for js pages) and scrapyd (deploying and distributing through API).

pyspider 是否真的可以替代所有这些工具?换句话说,pyspider 是 Scrapy 的直接替代品吗?如果不是,那么它涵盖哪些用例?

Is it true that pyspider alone can replace all of these tools? In other words, is pyspider a direct alternative to Scrapy? If not, then which use cases does it cover?

我希望我不会跨越太宽泛"或基于意见"的界限.

推荐答案

pyspider 和 Scrapy 具有相同的目的,即网页抓取,但对此有不同的看法.

pyspider and Scrapy have the same purpose, web scraping, but a different view about doing that.

  • 蜘蛛不应该停止,直到 WWW 死了.(信息在变化,数据在网站更新,蜘蛛应该有能力和责任抓取最新数据.这就是为什么pyspider有URL数据库,强大的调度器,@everyage> 等)

pyspider 不仅仅是一个框架,更是一个服务.(组件在隔离的进程中运行,lite - all 版本也作为服务运行,你不需要 Python 环境而是一个浏览器,关于 fetch 或 schedule 的一切都由脚本通过 API 控制而不是启动参数或全局配置,资源/项目由 pyspider 等管理...)

pyspider is a service more than a framework. (Components are running in isolated process, lite - all version is running as service too, you needn't have a Python environment but a browser, everything about fetch or schedule is controlled by script via API not startup parameters or global configs, resources/projects is managed by pyspider, etc...)

pyspider 是一个蜘蛛系统.(任何组件都可以替换,甚至可以用 C/C++/Java 或任何语言开发,以获得更好的性能或更大的容量)

pyspider is a spider system. (Any components can been replaced, even developed in C/C++/Java or any language, for better performance or larger capacity)

  • on_start vs start_url
  • 令牌桶 流量控制 vs download_delay
  • 返回json vs class Item
  • 消息队列 vs Pipeline
  • 内置 url 数据库 vs set
  • 持久性与内存中
  • PyQuery + 您喜欢的任何第三个包与内置 CSS/Xpath 支持
  • on_start vs start_url
  • token bucket traffic control vs download_delay
  • return json vs class Item
  • message queue vs Pipeline
  • built-in url database vs set
  • Persistence vs In-memory
  • PyQuery + any third package you like vs built-in CSS/Xpath support

事实上,我并没有过多地引用 Scrapy.pyspider 和 Scrapy 真的不一样.

In fact, I have not referred much from Scrapy. pyspider is really different from Scrapy.

但是,为什么不自己尝试?pyspider 也是 fast,有易于使用的 API,你可以试试无需安装.

But, why not try it yourself? pyspider is also fast, has easy-to-use API and you can try it without install.

这篇关于Scrapy 可以被 pyspider 取代吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆