如何自动执行Amazon AWS EC2进行抓取 [英] How to automate Amazon aws EC2 for scraping

查看:114
本文介绍了如何自动执行Amazon AWS EC2进行抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想设置一些Amazon EC2实例(多个)以从任意站点抓取数据.我以为它的设置方式是一个亚马逊实例,它是一个主机,它将以编程方式设置其他实例以进行抓取.现在,我有一些php脚本可以抓取我想要的方式,但是我该如何将我的主服务器设置为...

Hi I'd like to set up some amazon EC2 instances (multiple) to scrape data from arbitrary sites. The way I imagine it being set up is one amazon instance that's a master which will programatically set up other instances to scrape. Right now I have php scripts that can scrape the way I want it to, but how can I set up my master server to...

1)创建其他ec2实例

1) make other ec2 instances

2)主服务器和从服务器之间的通信

2) communicate between the master server and slave servers

推荐答案

可以通过在需要时使用主启动工作程序实例来自己构建此实例,向其发送抓取请求,并在需要时终止它们,并且通常自己编写所有业务流程的代码,并尝试使其高度可用.但这不是执行此操作的好方法.相反,您应该利用AWS功能.

You could build this yourself by having your master launch worker instances when needed, send them scrape requests, terminate them when needed and generally code all the orchestration yourself and try to make it highly available. But that's not a good way to do this. Instead, you should take advantage of AWS features.

您可以结合使用SQS组和Auto Scaling组.您的主实例会将刮刮请求添加到SQS队列,并且您将拥有一个Auto Scaling组

You could use a combination of SQS and Auto Scaling Groups. Your master instance would add scrape requests to an SQS queue and you would have an Auto Scaling Group triggered on SQS queue depth that launches new worker instances - this helps you to automate the launching of workers (scrapers) when the workload is high and terminate the workers when the workload is low. Those worker instances would pull a scrape request from the SQS queue, do the scraping work, and then repeat.

执行此操作的另一种方法是使用AWS Lambda.您可以从SQS或SNS触发Lambda函数.让您的主服务器将抓取请求添加到SQS队列中,或者让主服务器将请求发布到SNS主题中,然后从SQS队列或SNS主题驱动一个网络抓取Lambda函数(用JavaScript编写).

Another way to do this would be to use AWS Lambda. You can trigger Lambda functions from SQS or from SNS. Have your master add scrape requests to an SQS queue or have the master publish requests to an SNS topic, and then drive a web-scraper Lambda function (written in JavaScript) from the SQS queue or SNS topic.

我个人将首先调查Lambda路线.

Personally, I would investigate the Lambda route first.

这篇关于如何自动执行Amazon AWS EC2进行抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆