带有Amazon Web Services的Nutch 2.1(HBase,SOLR) [英] Nutch 2.1 (HBase, SOLR) with Amazon Web Services

查看:101
本文介绍了带有Amazon Web Services的Nutch 2.1(HBase,SOLR)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在本地毫无困难地体验了Nutch 2.1。我也尝试在3机分布式群集上。现在,我们正在讨论是否在Amazon Web Services上运行它。我对AWS没有太多经验。我的问题是,是否可以尝试在云上对Nutch2.1进行爬网和索引?我们将有什么可能的利弊?

I experienced Nutch 2.1 locally without any difficulty. I have also tried on a 3 machine distributed cluster. We're now discussing whether to run it with Amazon Web Services or not. I do not have much experience with AWS. My question is that, is it possible and neccessary to try Nutch2.1 crawling and indexing parts on the cloud. What possible advantages and disadvantages we will have?

谢谢。

推荐答案

如果您拥有与AWS集群(您打算投资)的容量相同的集群,除了下面的#1没有优势。

If you have a cluster with same capacity as that of a AWS cluster (that you plan to invest in) then there is no advantage except for #1 below.

在切换到AWS之前,您应该考虑以下几个因素:

Here are several factors that you should think about before switching to AWS:


  1. 已爬网的主机的位置:如果您坐在欧洲,而要爬网的网站托管在遥远的地方,比如说澳大利亚。如果您购买位于澳大利亚的AWS节点,则对数据进行爬网要比从欧洲进行爬网快得多。

  1. Locality of hosts crawled: If you are sitting in Europe and the websites that you want to crawl are hosted far away ... say Australia. If you buy AWS nodes located in Australia, it would be much faster for crawling that data rather than crawling from Europe.

成本 :要使用AWS机器,您需要按小时付费。你负担得起吗?如果不能更好地使用自己的计算机

Cost: For using AWS machines, you need to pay then on hourly basis. Can you afford that ? If not better use your own machines

当前群集容量:您当前的群集是否有足够的容量和空间来处理抓取的数据?我认为,由于Nutch在旨在运行在商品硬件上运行的Hadoop上运行,因此在计算速度方面不会有问题。

Current cluster capacity : does your current cluster has ample capacity and space to handle the amount of crawled data ? I think there wont be problem in terms of computational speed as Nutch runs on Hadoop which was designed to run on commodity hardware. Can your cluster accommodate entire data that is being fetched by the crawler.

数据量:您的集群可以容纳爬虫抓取的全部数据吗?正在爬行的数据?如果更少,那么拥有一个AWS集群是没有意义的。

Volume of data : What is a rough estimate of the data that is being crawled ? If its less, then it makes no sense to have an AWS cluster.

时间限制:是否有一定的完成时间对于爬网?

Time constraints : Is there any time bound for completion for the crawl ?

如果您是为专业项目而这样做,那么必须考虑这些因素。

If you are doing this for a professional project, then these factors must be given a thought.

如果您是出于娱乐/爱好/学习的目的,请继续使用AWS的免费层节点。这些是亚马逊免费提供的低容量节点。学习新事物很有趣:)

If you are doing it for fun/hobby/learning, go ahead and use free tier nodes of AWS. Those are low capacity nodes given free by Amazon. Its fun to learn new things :)

AWS的优势:


  1. 无需购买用于设置集群的计算机。在没有终端PC的情况下无需任何硬件即可开始使用。

  2. 位置

  3. 无需照看机器。如果节点崩溃严重,请保留该节点(不是您的问题:P)。购买一个新的,将其添加到群集中,然后继续。

AWS的缺点:


  1. 成本很高。

  2. 将数据复制到AWS集群以外的任何计算机均需付费。

  3. 当您放弃时,您的数据将不会保留采购的AWS节点。如果您要坚持下去,请向他们付款并使用S3存储服务。

这篇关于带有Amazon Web Services的Nutch 2.1(HBase,SOLR)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆