托管搜寻器的最佳解决方案? [英] Best solution to host a crawler?

查看:62
本文介绍了托管搜寻器的最佳解决方案?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个搜寻器,可以在几个不同的域中搜寻新的帖子/内容.内容的总量为数十万页,并且每天都会添加很多新内容.因此,为了能够浏览所有这些内容,我需要我的搜寻器以24/7进行搜寻.

I have a crawler that crawl a few different domains for new posts/content. The total amount of content is hundred of thousands of pages, and there is a lot of new content added each day. So to be able to crawl through all this content, I need my crawler to be crawling 24/7.

当前,我将搜寻器脚本托管在与搜寻器向其添加内容的站点相同的服务器上,并且我只能在夜间运行cronjob来运行脚​​本,因为当我这样做时,网站基本上会停止工作,因为脚本的负载.换句话说,这是一个糟糕的解决方案.

Currently I host the crawler script on the same server as the site the crawler is adding the content to, and I'm only able to run a cronjob to run the script during nighttime, because when I do, the website basically stops working because the load of the script. In other words, a pretty crappy solution.

所以基本上我想知道对于这种解决方案,我最好的选择是什么?

So basically I wonder what my best option is for this kind of solution?

  • 是否可以从同一主机上继续运行搜寻器,但以某种方式平衡负载以使脚本不会杀死该网站?

  • Is it possible to keep running the crawler from the same host, but somehow balancing the load so that the script doesnt kill the website?

我要寻找哪种类型的主机/服务器来托管搜寻器?除了普通的虚拟主机之外,我还需要其他规格吗?

What kind of host/server would I be looking for to host a crawler? Is there any other specifications I need than a normal web host?

搜寻器保存其搜寻的图像.如果我将搜寻器托管在辅助服务器上,如何将映像保存在站点服务器上?我想我不希望CHMOD 777出现在我的上载文件夹中,并允许任何人在我的服务器上放置文件.

The crawler saves images that it crawls. If I host my crawler on a secondary server, how do I save my images on the server of my site? I guess I dont want CHMOD 777 on my uploads-folder and allow anyone to put files on my server.

推荐答案

我决定选择Amazon Web Services来托管我的搜寻器,它们既具有队列的SQS又具有自动可扩展实例.它还有S3,我可以在其中存储所有图像.

I decided to choose Amazon Web Services to host my crawler where they both have SQS for queues but also auto scalable instances. It also have S3 where I can store all my images.

我还决定将整个搜寻器改写为Python而不是PHP,以便更轻松地利用队列等优势,并使应用程序100%的时间运行,而不是使用cronjobs.

I also decided to rewrite my whole crawler to Python instead of PHP to more easily take advantage of things such as queues and to keep the app going 100% of the time, instead of using cronjobs.

所以我做了什么,这意味着什么

So what I did, and what it means

  1. 我为我的搜寻器设置了一个Elastic Beanstalk应用程序,该应用程序设置为"Worker",并收听SQS,其中存储了所有需要搜寻的域. SQS是一个队列",我可以在其中保存需要爬网的每个域,爬网程序将侦听队列并一次获取一个域,直到队列完成.不需要"cronjobs"或类似的东西,只要队列中有数据,它将立即将其发送到搜寻器.这意味着爬虫在24/7的时间内占100%的时间.

  1. I set up a Elastic Beanstalk Application for my crawler that is set to "Worker" and listening to a SQS where I store all the domains that need to be crawled. An SQS is a "queue" where I can save each domain that needs to be crawled, and the crawler will listen to the queue and fetch one domain at a time until the queue is done. There is no need for "cronjobs" or anything like that, as soon as the queue get data into it, it will send it to the crawler. Meaning the crawler is up 100% of the time, 24/7.

该应用程序设置为自动缩放,这意味着当队列中有太多域时,它将设置第二,第三,第四等实例/爬网程序,以加快该过程.我认为对于任何想要设置搜寻器的人来说,这都是非常非常重要的一点.

The Application is set to auto scaling, meaning that when I have too many domains in the queue, it will set up a second, third, fourth etc... instance/crawler to speed up the process. I think this is a very very very important point for anyone that wants to set up a crawler.

结果非常好.当我每15分钟在cronjobs上运行一个PHP Crawler时,我每小时可以抓取约600个URL.现在,我可以毫无问题地每小时抓取1万多个网址,甚至更多,具体取决于我设置自动缩放比例的方式.

The results have been great. When I had a PHP Crawler running on cronjobs every 15min, I could crawl about 600 urls per hour. Now I can without problems crawl 10'000+ urls per hour, even more depending on how I set my auto scaling.

这篇关于托管搜寻器的最佳解决方案?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆