Hbase列家族 [英] Hbase column family

查看:134
本文介绍了Hbase列家族的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Hbase文档说,避免创建超过2-3列的系列,因为Hbase不能处理2-3列以上的系列。其原因是压实和冲洗,因此IO。然而,如果我所有的列都被填充(对于每一行),那么我认为这个推理并不重要,所以考虑到我对列的访问完全是随机的(我想访问任何列的组合) - 我可以有一个列列家族 - 一个列配置(有效地尝试使它成为纯粹的列)。

有许多博客/ wiki可以解释这一点,但它们似乎都相互矛盾,并增加了更多的混淆。我只是似乎无法消化这样一个事实,即Hbase倾向于使用一个列系列,那么调用的是什么是一个列存储?

解决方案

目前(尽管预计会发生变化),一个区域的所有列族都会被刷新到一起。这就是为什么人们说HBase不适合超过2或3个专栏家庭的主要原因。考虑两个CF,每个都有一列。 A栏:存储整个网页文本。 B列:B存储页面中的字数。因此,每次我们刷新A:A(由于A:A的数据更大,会更频繁地发生),我们还需要通过一个单独的文件I / O对B列的I / O杂耍路由:即使没有如果您将A和B存储在同一列系列(A:A)中,则需要使用B:B只持有数字,我可以持续数月而不会冲洗它。

和A:B),你可能会看到非常好的刷新I / O性能,并且因为大多数HBase读取完全来自memstore,所以您可能会发现读取速度是相同的。



此外,也许更重要的是,如果列的基数是非常不同的,那么您的regionservers将需要为不太密集的列家族维护无用的大多数空文件。这绝不会改变。



所有这些都可以在 HBase Book

因此,正如在所有这些表现情况下,在决定什么之前测量 正确的路径是。


Hbase documentation says that avoid creating more than 2-3 column families because Hbase does not handle more than 2-3 column families very well. The reason for this is compaction and flushing and hence the IO. However, if all my columns are always populated (for every row) then I think this reasoning is not that important, so, considering that my access to columns is completely random (I want to access any combination of columns) - can I have one column family -one column configuration (effectively trying to make it pure columnar).

There are many blogs/wikis explaining this but they all seem to contradict and add more confusion. I just don't seem to be able to digest the fact that Hbase prefers one column family, then what's the point of calling is a column store?

解决方案

Currently (though this is expected to change), all of the column families for a region are flushed together. This is the primary reason why people say "HBase doesn't do well with more than 2 or 3 column families". Consider two CF's, each with one column. Column A:A stores whole web page texts. Column B:B stores the number of words in the page. So every time we flush A:A (which will happen more often because A:A's data is far bigger), we also need to go through a whole separate file I/O juggling routing for column B:B, even though there is no need to- with B:B only holding numbers, I could go for months without flushing it.

If you store A and B in the same column family (A:A and A:B), you will probably see vastly better flush I/O performance, and because most HBase reads are purely from the memstore, you will probably find that read speeds are equivalent.

Also, and perhaps more importantly, if the cardinality of the columns is wildly different, then your regionservers will need to maintain useless mostly-empty files for your less-dense column families. This will never change.

All of this is available in the HBase Book.

So, as in all such performance situations, measure before deciding what the "correct" path is.

这篇关于Hbase列家族的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆