HBase基于列或基于行 [英] column based or row based for HBase

查看:395
本文介绍了HBase基于列或基于行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述




  • 我读了一些技术文档,提到了HBase的优点正在使用基于列的存储将相似的数据存储在一起以促进压缩。所以它意味着不同行的相同列被存储在一起;
  • 但是我也学习了HBase是一个排序的键值映射。它使用键来解决该键(行)的所有相关列,所以它似乎是一个基于行的存储?



它是感谢任何人都可以澄清我的疑惑。



在此先感谢,
George

解决方案

com / content / cloudera / en / resources / library / hbasecon / video-hbasecon-2012-hbasecon-2012.htmlrel =noreferrer> http://www.cloudera.com/content/cloudera/en/resources/ library / hbasecon / video-hbasecon-2012-hbasecon-2012.html

简而言之,HBase中的每一行实际上都是一个键/值映射,其中你可以有任意数量的列(键),每个列都有一个值。 (在技术上,每一个都可以有不同的时间戳多个值)。



另外,列族允许您托管多个键/值映射在同一行中的不同物理(磁盘)文件中。这有助于在有多组值通常无法与其他组访问的情况下访问的情况下进行优化(因此您的磁盘读取量较少)。当然,如果将列分隔成两个列族,那么读取一行中的所有值时工作量会更大,因为所需的磁盘访问次数是2倍。与更多的标准面向列数据库不同,我从来没有听说过任何人创建一个HBase表,每个逻辑列都有一个列族。与柱系列相关的开销,一般建议通常不超过3或4个。列族是设计时间信息,这意味着您必须在创建(或更改)表时指定它们。通常情况下,我发现列家族是一种高级设计选项,您只有在深入了解HBase架构并且可以证明它是净收益。

因此,总体而言,虽然HBase可以采用列式方式,但并不是HBase中的默认或最常见的设计模式。最好把它看作一个带有键/值映射的行存储。


I am wondering whether HBase is using column based storage or row based storage?

  • I read some technical documents and mentioned advantages of HBase is using column based storage to store similar data together to foster compression. So it means same columns of different rows are stored together;
  • But I also learned HBase is a sorted key-value map. It uses key to address all related columns for that key (row), so it seems to be a row based storage?

It is appreciated if anyone could clarify my confusions.

thanks in advance, George

解决方案

George, here's a presentation I gave about understanding HBase schemas from HBaseCon 2012:

http://www.cloudera.com/content/cloudera/en/resources/library/hbasecon/video-hbasecon-2012-hbasecon-2012.html

In short, each row in HBase is actually a key/value map, where you can have any number of columns (keys), each of which has a value. (And, technically, each of which can have multiple values with different timestamps).

Additionally, "column families" allow you to host multiple key/value maps in the same row, in different physical (on disk) files. This helps optimize in situations where you have sets of values that are usually accessed disjointly from other sets (so you have less stuff to read off disk). The trade off is that, of course, it's more work to read all the values in a row if you separate columns into two column families, because there are 2x the number of disk accesses needed.

Unlike more standard "column oriented" databases, I've never heard of anyone creating an HBase table that had a column family for every logical column. There's overhead associated with column families, and the general advice is usually to have no more than 3 or 4 of them. Column families are "design time" information, meaning you must specify them at the time you create (or alter) the table.

Generally, I find column families to be an advanced design option that you'd only use once you have a deep understanding of HBase's architecture and can show that it would be a net benefit.

So overall, while it's true that HBase can act in a "column oriented" way, it's not the default nor the most common design pattern in HBase. It's better to think of it as a row store with key/value maps.

这篇关于HBase基于列或基于行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆