用于单调递增键的HBase行键设计 [英] HBase row key design for monotonically increasing keys
问题描述
< prefix>〜1 $我有一个HBase表,我在这里写行键: b $ b< prefix>〜2
< prefix>〜3
...
<前缀>〜9
< prefix>〜10
HBase shell的扫描结果为:
< prefix>〜1
< prefix>〜10
< prefix>〜2
< prefix>〜3
。 ..
< prefix>〜9
行键的设计应该如此密钥< prefix>〜10
的行最后一次?我正在寻找一些推荐的方法或者更为流行的方式来设计HBase行键。
应该设计一个行密钥,以便密钥~10的行最后一次?
以这种方式查看扫描输出,因为HBase中的rowkeys保留不管广告插入顺序如何,按照字典顺序排列 。这意味着它们是基于它们的字符串表示进行排序的。请记住,HBase中的rowkeys被视为具有字符串表示的字节数组。最低顺序rowkey首先出现在表中。这就是为什么10出现在2之前等等。请参阅此页面上的行部分以了解更多信息。
当您用零填充整数时,它们的自然顺序保持不变,同时按照字典顺序进行排序,这就是为什么您会看到扫描顺序与您插入数据的顺序相同的原因。要做到这一点,您可以按照@shutty的建议来设计行键。
我正在寻找一些推荐的方法或更为流行的方法来设计HBase行键。
为了设计一个好的设计,需要遵循一些通用的准则:
HTH
I've an HBase table where I'm writing the row keys like:
<prefix>~1
<prefix>~2
<prefix>~3
...
<prefix>~9
<prefix>~10
The scan on the HBase shell gives an output:
<prefix>~1
<prefix>~10
<prefix>~2
<prefix>~3
...
<prefix>~9
How should a row key be designed so that the row with key <prefix>~10
comes last? I'm looking for some recommended ways or the ways that are more popular for designing HBase row keys.
How should a row key be designed so that the row with key ~10 comes last?
You see the scan output in this way because rowkeys in HBase are kept sorted lexicographically irrespective of the insertion order. This means that they are sorted based on their string representations. Remember that rowkeys in HBase are treated as an array of bytes having a string representation. The lowest order rowkey appears first in a table. That's why 10 appears before 2 and so on. See the sections Rows on this page to know more about this.
When you left pad the integers with zeros their natural ordering is kept intact while sorting lexicographically and that's why you see the scan order same as the order in which you had inserted the data. To do that you can design your rowkeys as suggested by @shutty.
I'm looking for some recommended ways or the ways that are more popular for designing HBase row keys.
There are some general guidelines to be followed in order to devise a good design :
- Keep the rowkey as small as possible.
- Avoid using monotonically increasing rowkeys, such as timestamp etc. This is a poor shecma design and leads to RegionServer hotspotting. If you can't avoid that use someway, like hashing or salting to avoid hotspotting.
- Avoid using Strings as rowkeys if possible. String representation of a number takes more bytes as compared to its integer or long representation. For example : A long is 8 bytes. You can store an unsigned number up to 18,446,744,073,709,551,615 in those eight bytes. If you stored this number as a String -- presuming a byte per character -- you need nearly 3x the bytes.
- Use some mechanism, like hashing, in order to get uniform distribution of rows in case your regions are not evenly loaded. You could also create pre-splitted tables to achieve this.
See this link for more on rowkey design.
HTH
这篇关于用于单调递增键的HBase行键设计的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!