HBase如何通过regionservers分区表? [英] How HBase partitions table across regionservers?

查看:134
本文介绍了HBase如何通过regionservers分区表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请告诉我HBFS如何跨regionservers分区表。



例如,假设我的行键是从0到10M的整数,并且有10个regionservers。所有行的键值为0 - 10M,第二个1M - 2M,第三个2M - 3M,...第十个9M - 10M?

我想让我的行键是时间戳,但我个案大多数查询将适用于最新的日期,所有查询将由只有一个区域服务器处理,是真的吗?



或者这个数据可能会传播不同的?

或者也许我可以以某种方式创建比区域服务器更多的区域,所以(根据给定的示例)服务器1将具有0-0,5M和3M -3.5M的密钥,这样我的数据会更平均地传播,这可能吗?




更新

我只是发现有选项 hbase.hregion.max.filesize ,你认为这会解决我的问题吗?

解决方案

WRT分区,你可以阅读Lars'博客文章关于HBase的架构或者HBase克隆的Google Bigtable论文。

如果你的行键只是时间戳,那么是的区域最大的密钥总是会被新的请求命中(因为一个区域只能由一个区域服务器提供)。



您是否想要使用时间戳短扫描?如果是这样,请考虑腌制您的密钥(搜索Google搜索Mozilla如何使用Sorocco)。

你可以用任何ID标记时间戳吗?例如,如果您只为特定用户请求数据,那么在该ts前加上该用户ID,它会给您一个更好的负载分配。



如果不是,那么使用UUID或任何其他可以随机分配您的密钥。



关于 hbase.hregion.maxfilesize



在该表上设置maxfilesize(你可以使用shell),并不是说每个区域都是X MB(其中X是你设置的值)很大。所以我们假设你的行键都是时间戳,这意味着每一个新的行键都比前一个键大。这意味着它将始终插入空白结束键(最后一个)的区域。在某些时候,其中一个文件会比maxfilesize(通过压缩)变大,并且该区域将在中间被分割。较低的键将在他们自己的区域中,较高的键在另一个中。但是由于你的新行密钥总是比前一个更大,这意味着你只能写入新的区域(依此类推)。

tl; dr尽管您有超过1,000个区域,但使用此架构,具有最大行键的区域将始终获得写入,这意味着托管区域服务器将成为瓶颈。

Please tell me how HBase partitions table across regionservers.

For example, let's say my row keys are integers from 0 to 10M and I have 10 regionservers.
Does this mean that first regionserver will store all rows with keys with values 0 - 10M, second 1M - 2M, third 2M-3M , ... tenth 9M - 10M ?

I would like my row key to be timestamp, but I case most queries would apply to latest dates, all queries would be processed by only one regionserver, is it true?

Or maybe this data would be spread differently?
Or maybe can I somehow create more regions than I have region servers, so (according to given example) server 1 would have keys 0 - 0,5M and 3M - 3,5M, this way my data would be spread more equally, is this possible?


update

I just found that there's option hbase.hregion.max.filesize, do you think this will solve my problem?

解决方案

WRT partitionning, you can read Lars' blog post on HBase's architecture or Google's Bigtable paper which HBase "clones".

If your row key is only a timestamp, then yes the region with the biggest keys will always be hit with new requests (since a region is only served by a single region server).

Do you want to use timestamps in order to do short scans? If so, consider salting your keys (search google for how Mozilla did it with Sorocco).

Can your prefix the timestamp with any ID? For example, if you only request data for specific users, then prefix the ts with that user ID and it will give you a much better load distribution.

If not, then use UUIDs or anything else that will randomly distribute your keys.

About hbase.hregion.maxfilesize

Setting the maxfilesize on that table (which you can do with the shell), doesn't make it that each region is exactly X MB (where X is the value you set) big. So let's say your row keys are all timestamps, which means that each new row key is bigger than the previous one. This means that it will always be inserted in the region with the empty end key (the last one). At some point, one of the files will grow bigger than maxfilesize (through compactions), and that region will be split around the middle. The lower keys will be in their own region, the higher keys in another one. But since your new row key is always bigger than the previous, this means that you will only write to that new region (and so on).

tl;dr even though you have more than 1,000 regions, with this schema the region with the biggest row keys will always get the writes, which means that the hosting region server will become a bottleneck.

这篇关于HBase如何通过regionservers分区表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆