以受控方式运行数十个 Scrapy 蜘蛛 [英] Running dozens of Scrapy spiders in a controlled manner

查看:24
本文介绍了以受控方式运行数十个 Scrapy 蜘蛛的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试构建一个系统来运行 几十个 Scrapy 蜘蛛,将结果保存到 S3,并在完成时通知我.StackOverflow 上有几个类似的问题(例如 this one另一个),但他们似乎都使用相同的建议(来自 Scrapy 文档):设置一个 CrawlerProcess,添加蜘蛛到它,然后点击 start().

I'm trying to build a system to run a few dozen Scrapy spiders, save the results to S3, and let me know when it finishes. There are several similar questions on StackOverflow (e.g. this one and this other one), but they all seem to use the same recommendation (from the Scrapy docs): set up a CrawlerProcess, add the spiders to it, and hit start().

当我对所有 325 个蜘蛛尝试使用此方法时,它最终会锁定并失败,因为它试图在运行它的系统上打开过多的文件描述符.我尝试过一些没有用的东西.

When I tried this method with all 325 of my spiders, though, it eventually locks up and fails because it attempts to open too many file descriptors on the system that runs it. I've tried a few things that haven't worked.

使用 Scrapy 运行大量蜘蛛的推荐方法是什么?

What is the recommended way to run a large number of spiders with Scrapy?

编辑添加:我知道我可以扩展到多台机器并支付服务来帮助协调(例如 ScrapingHub),但我更喜欢使用某种方式在一台机器上运行它进程池 + 队列,以便只有少数固定数量的蜘蛛同时运行.

Edited to add: I understand I can scale up to multiple machines and pay for services to help coordinate (e.g. ScrapingHub), but I'd prefer to run this on one machine using some sort of process pool + queue so that only a small fixed number of spiders are ever running at the same time.

推荐答案

最简单的方法是从命令行运行它们.例如:

The simplest way to do this is to run them all from the command line. For example:

$ scrapy list | xargs -P 4 -n 1 scrapy crawl

将运行您所有的蜘蛛,最多可同时运行 4 个.此命令完成后,您可以在脚本中发送通知.

Will run all your spiders, with up to 4 running in parallel at any time. You can then send a notification in a script once this command has completed.

更可靠的选择是使用 scrapyd.它带有一个 API、一个最小的 Web 界面等.它还会对爬网进行排队,并且一次只运行某个(可配置的)数量.您可以通过 API 与它交互以启动您的蜘蛛并在它们全部完成后发送通知.

A more robust option is to use scrapyd. This comes with an API, a minimal web interface, etc. It will also queue the crawls and only run a certain (configurable) number at once. You can interact with it via the API to start your spiders and send notifications once they are all complete.

Scrapy Cloud 非常适合此[免责声明:我为 Scrapinghub 工作].它将允许您一次只运行一定数量的作业,并拥有一个待处理作业队列(您可以对其进行修改、在线浏览、确定优先级等)以及比 scrapyd 更完整的 API.

Scrapy Cloud is a perfect fit for this [disclaimer: I work for Scrapinghub]. It will allow you only to run a certain number at once and has a queue of pending jobs (which you can modify, browse online, prioritize, etc.) and a more complete API than scrapyd.

您不应该在一个进程中运行所有蜘蛛.它可能会更慢,可能会引入不可预见的错误,并且您可能会遇到资源限制(就像您所做的那样).如果您使用上述任何选项单独运行它们,只需运行足够的硬件资源(通常是 CPU/网络)即可.如果此时您仍然遇到文件描述符问题,您应该增加限制.

You shouldn't run all your spiders in a single process. It will probably be slower, can introduce unforeseen bugs, and you may hit resource limits (like you did). If you run them separately using any of the options above, just run enough to max out your hardware resources (usually CPU/network). If you still get problems with file descriptors at that point you should increase the limit.

这篇关于以受控方式运行数十个 Scrapy 蜘蛛的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆