大规模数据处理Hbase和Cassandra [英] Large scale data processing Hbase vs Cassandra

查看:176
本文介绍了大规模数据处理Hbase和Cassandra的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我研究大规模数据存储解决方案后,我几乎降落在Cassandra。但它一般说Hbase是更好的解决方案,用于大规模数据处理和分析。



虽然两者都是相同的键/值存储,并且都可以运行(最近的Cassandra)Hadoop层,那么什么使Hadoop在需要处理/分析数据。



我还发现了两个很好的细节
http://ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-why-we-moved/



但我仍然在寻找Hbase的具体优势。



虽然我更加相信Cassandra,节点和无缝复制,无故障点。

解决方案

尝试确定哪一个最适合你真的取决于什么你将使用它,他们每个都有自己的优势,没有任何更多的细节,它变成更多的宗教战争。你所引用的那篇文章已经超过一年了,自那以后,它们都经历了许多变化。还请记住,我不熟悉最近的Cassandra的发展。



说到这里,我将解释HBase提交者Andrew Purtell并添加一些我自己的经验:




  • HBase在更大的生产环境(1000个节点),尽管仍然在Cassandra的〜400节点安装


  • HBase和Cassandra都支持集群/数据中心之间的复制。


  • 如果强大的一致性是你的应用程序所需要的,那么HBase就是这样的。可能更好的适合。它的设计从根本上是一致的。例如,它允许更简单的实现原子计数器(我认为Cassandra只是得到它们)以及检查和放置操作。


  • 写性能是伟大的,从


  • 我不知道Cassandra的有序分区器的当前状态,但是,在过去它需要手动重新平衡。 HBase会为您处理。


  • Cassandra和HBase都很复杂,Cassandra只是隐藏起来更好。 HBase通过使用HDFS为其存储更多地暴露它,如果你看代码库Cassandra是分层的。如果你比较Dynamo和Bigtable论文,你可以看到Cassandra的操作理论实际上更复杂。


  • HBase有更多的单元测试FWIW。


  • 所有Cassandra RPC都是Thrift,HBase有一个Thrift,REST和本地Java。 Thrift和REST只提供整个客户端API的一个子集,但如果你想要纯粹的客户端是纯粹的客户端。


  • 对等和主从到从。


  • HBase不仅限于传统的HDFS,您可以更改您的底层存储根据您的需要。 MapR 看起来很有趣,虽然我没有自己使用它,但我听到了很多好东西。



I am nearly landed at Cassandra after my research on large scale data storage solutions. But its generally said that Hbase is better solution for large scale data processing and analysis.

While both are same key/value storage and both are/can run (Cassandra recently) Hadoop layer then what makes Hadoop a better candidate when processing/analysis is required on large data.

I also found good details about both at http://ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-why-we-moved/

but I'm still looking for concrete advantages of Hbase.

While I am more convinced about Cassandra because its simplicity for adding nodes and seamless replication and no point of failure features. And it also keeps secondary index feature so its a good plus.

解决方案

Trying to determine which is best for you really depends on what you are going to use it for, they each have their advantages and without any more details it becomes more of a religious war. That post you referenced is also more than a year old and both have gone through many changes since then. Please also keep in mind I am not familiar with the more recent Cassandra developments.

Having said that, I'll paraphrase HBase committer Andrew Purtell and add some of my own experiences:

  • HBase is in larger production environments (1000 nodes) although that is still in the ballpark of Cassandra's ~400 node installs so its really a marginal difference.

  • HBase and Cassandra both supports replication between clusters/datacenters. I believe HBase's exposes more to the user so it appears more complicated but then you also get more flexibility.

  • If strong consistency is what your application needs then HBase is likely a better fit. It is designed from the ground up to be consistent. For example it allows for simpler implementation of atomic counters (I think Cassandra just got them) as well as Check and Put operations.

  • Write performance is great, from what I understand that was one of the reasons Facebook went with HBase for their messenger.

  • I'm not sure of the current state of Cassandra's ordered partitioner, but in the past it required manual rebalancing. HBase handles that for you if you want. The ordered partitioner is important for Hadoop style processing.

  • Cassandra and HBase are both complex, Cassandra just hides it better. HBase exposes it more via using HDFS for its storage, if you look at the codebase Cassandra is just as layered. If you compare the Dynamo and Bigtable papers you can see that Cassandra's theory of operation is actually more complex.

  • HBase has more unit tests FWIW.

  • All Cassandra RPC is Thrift, HBase has a Thrift, REST and native Java. The Thrift and REST do only offer a subset of the total client API but if you want pure speed the native Java client is there.

  • There are advantages to both peer to peer and master to slave. The master - slave setup generally makes it easier to debug and reduces quite a bit of complexity.

  • HBase is not tied to only traditional HDFS, you can change out your underlying storage depending on your needs. MapR looks quite interesting and I have heard good things although I have not used it myself.

这篇关于大规模数据处理Hbase和Cassandra的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆