Scrapy 可以被 pyspider 取代吗? [英] Can Scrapy be replaced by pyspider?
问题描述
我一直在广泛使用 Scrapy
网络抓取框架,但是,最近我发现还有另一个框架/系统叫做 pyspider
,根据它的 github 页面,它是新鲜、积极开发和流行的.
I've been using Scrapy
web-scraping framework pretty extensively, but, recently I've discovered that there is another framework/system called pyspider
, which, according to it's github page, is fresh, actively developed and popular.
pyspider
的主页列出了一些现成支持的内容:
pyspider
's home page lists several things being supported out-of-the-box:
带有脚本编辑器、任务监视器、项目经理和结果查看器的强大 WebUI
Powerful WebUI with script editor, task monitor, project manager and result viewer
支持Javascript页面!
Javascript pages supported!
任务优先级、重试、周期性和按年龄或索引页面中的标记(如更新时间)重新抓取
Task priority, retry, periodical and recrawl by age or marks in index page (like update time)
分布式架构
这些是 Scrapy
本身不提供的东西,但是,在 portia
(用于 Web UI),scrapyjs
(用于 js 页面)和 scrapyd
(通过 API 部署和分发).
These are the things that Scrapy
itself doesn't provide, but, it is possible with the help of portia
(for Web UI), scrapyjs
(for js pages) and scrapyd
(deploying and distributing through API).
仅 pyspider
是否真的可以替代所有这些工具?换句话说,pyspider
是 Scrapy 的直接替代品吗?如果不是,那么它涵盖哪些用例?
Is it true that pyspider
alone can replace all of these tools? In other words, is pyspider
a direct alternative to Scrapy? If not, then which use cases does it cover?
我希望我不会跨越太宽泛"或基于意见"的界限.
推荐答案
pyspider 和 Scrapy 具有相同的目的,即网页抓取,但对此有不同的看法.
pyspider and Scrapy have the same purpose, web scraping, but a different view about doing that.
蜘蛛不应该停止,直到 WWW 死了.(信息在变化,数据在网站更新,蜘蛛应该有能力和责任抓取最新数据.这就是为什么pyspider有URL数据库,强大的调度器,
@every
,age
> 等)
pyspider 不仅仅是一个框架,更是一个服务.(组件在隔离的进程中运行,lite - all
版本也作为服务运行,你不需要 Python 环境而是一个浏览器,关于 fetch 或 schedule 的一切都由脚本通过 API 控制而不是启动参数或全局配置,资源/项目由 pyspider 等管理...)
pyspider is a service more than a framework. (Components are running in isolated process, lite - all
version is running as service too, you needn't have a Python environment but a browser, everything about fetch or schedule is controlled by script via API not startup parameters or global configs, resources/projects is managed by pyspider, etc...)
pyspider 是一个蜘蛛系统.(任何组件都可以替换,甚至可以用 C/C++/Java 或任何语言开发,以获得更好的性能或更大的容量)
pyspider is a spider system. (Any components can been replaced, even developed in C/C++/Java or any language, for better performance or larger capacity)
和
on_start
vsstart_url
- 令牌桶 流量控制 vs
download_delay
返回json
vsclass Item
- 消息队列 vs
Pipeline
- 内置 url 数据库 vs
set
- 持久性与内存中
- PyQuery + 您喜欢的任何第三个包与内置 CSS/Xpath 支持
on_start
vsstart_url
- token bucket traffic control vs
download_delay
return json
vsclass Item
- message queue vs
Pipeline
- built-in url database vs
set
- Persistence vs In-memory
- PyQuery + any third package you like vs built-in CSS/Xpath support
事实上,我并没有过多地引用 Scrapy.pyspider 和 Scrapy 真的不一样.
In fact, I have not referred much from Scrapy. pyspider is really different from Scrapy.
但是,为什么不自己尝试?pyspider 也是 fast,有易于使用的 API,你可以试试无需安装.
But, why not try it yourself? pyspider is also fast, has easy-to-use API and you can try it without install.
这篇关于Scrapy 可以被 pyspider 取代吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!