我如何在HBase中进行预分割 [英] How can I pre split in hbase

查看:67
本文介绍了我如何在HBase中进行预分割的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将数据存储在具有5个区域服务器的hbase中.我使用url的md5哈希作为我的行键.当前,所有数据仅存储在一个区域服务器中.因此,我想预分割区域,以便数据将在所有区域服务器中均匀分布,以便数据将在每个区域服务器中均匀分布. 我想将数据拆分为行键的第一个字符,因为第一个字符是从0到f(16个字符).像rowkey从0到3的数据将进入第一个区域服务器,第2个3-6,第3个6-9,第4个a-d,第5个d-f.我该怎么办?

I am storing data in hbase having 5 region servers. I am using md5 hash of url as my row keys. Currently all the data is getting stored in one region server only. So I want to pre-split the regions so that data will go uniformly across all region server, so that data will go in each region server uniformly. I want to split data as first character of row key.As first character is from 0 to f(16 characters). Like data with rowkey starting from 0 to 3 will go in 1st region server, 3-6 on 2nd , 6-9 on 3rd, a-d on 4th, d-f on 5th. How can I do it ?

推荐答案

创建表时可以提供SPLITS属性.

You can provide a SPLITS property when creating the table.

create 'tableName', 'cf1', {SPLITS => ['3','6','9','d']}

4个分割点将生成5个区域.

The 4 split points will generate 5 regions.

请注意,HBase的 DefaultLoadBalancer 不会如果要保证区域服务器之间100%的平均分配,则可能是区域服务器托管来自同一表的多个区域.

Please be noticed that HBase's DefaultLoadBalancer doesn't guarantee a 100% even distribution between regionservers, it could happen that a regionserver hosts multiple regions from the same table.

有关其工作原理的更多信息,请参见

For more information about how it works take a look at this:

public List<RegionPlan> balanceCluster(Map<ServerName,List<HRegionInfo>> clusterState)

根据指定的映射生成全局负载平衡计划 服务器信息到每个服务器的最大负载区域.这 负载平衡不变的是,所有服务器都在以下1个区域内 每台服务器的平均区域数.如果平均值是整数 数量,所有服务器将平均平衡.否则,全部 服务器将具有最低(平均)或最高(平均)区域. HBASE-3609使用Guava的MinMaxPriorityQueue为地区建模 我们可以从队列的两端获取数据.一开始,我们 检查是否有Master刚发现的空区域服务器. 如果是这样,我们从头/尾交替选择新/旧区域 分别为regionsToMove.这种交替避免了年轻人的聚集 新发现的区域服务器上的区域.否则,我们选择 来自regionsToMove头的新区域.的另一个改进 HBASE-3609是我们将regionsToMove中的区域分配给欠载的区域 以循环方式运行服务器.以前是一台负载不足的服务器 在我们移至下一个负载不足的服务器之前,将被填充, 导致年轻地区聚集.最后,我们随机洗牌 负载不足的服务器,以便它们相对接收卸载的区域 在对balanceCluster()的调用之间平均分配.该算法目前 如此实现:

Generate a global load balancing plan according to the specified map of server information to the most loaded regions of each server. The load balancing invariant is that all servers are within 1 region of the average number of regions per server. If the average is an integer number, all servers will be balanced to the average. Otherwise, all servers will have either floor(average) or ceiling(average) regions. HBASE-3609 Modeled regionsToMove using Guava's MinMaxPriorityQueue so that we can fetch from both ends of the queue. At the beginning, we check whether there was empty region server just discovered by Master. If so, we alternately choose new / old regions from head / tail of regionsToMove, respectively. This alternation avoids clustering young regions on the newly discovered region server. Otherwise, we choose new regions from head of regionsToMove. Another improvement from HBASE-3609 is that we assign regions from regionsToMove to underloaded servers in round-robin fashion. Previously one underloaded server would be filled before we move onto the next underloaded server, leading to clustering of young regions. Finally, we randomly shuffle underloaded servers so that they receive offloaded regions relatively evenly across calls to balanceCluster(). The algorithm is currently implemented as such:

  1. 确定每个服务器应具有的两个有效区域数,即MIN = floor(平均)和MAX = ceiling(平均).
  2. 对负载最大的服务器进行迭代,从每个服务器中删除区域,以便每个服务器都完全托管MAX区域.到达服务器后停止 已经具有< = MAX个区域.命令区域从大多数区域移开 最近到最少.
  3. 对负载最小的服务器进行迭代,分配区域,以便每个服务器都具有MIN个区域.到达服务器后停止 已经有> = MIN个区域.分配给欠载区域 服务器是在上一步中删除的服务器.有可能的 没有足够的区域来填充每个欠载 服务器到MIN.如果是这样,我们最终需要做一些区域工作 因此,需要区域.我们也有可能填补每个 负载不足,但最终出现未分配给 服务器超载,但仍然没有分配.如果两者都不 这些条件保持不变(不需要区域来填充欠载 服务器,没有过载服务器剩余的区域),我们已经完成并 返回.否则,我们将在下面处理这些情况.
  4. 如果neededRegions不为零(服务器的负载仍然不足),我们将再次迭代负载最大的服务器,从而从 每个(这使它们从具有MAX区域变为具有MIN区域).
  5. 我们现在肯定有更多需要分配的区域,无论是从上一步还是从过载导致的原始脱落 服务器.将填充最少的服务器迭代到MIN.要是我们 仍然有更多需要分配的区域,再次最少 加载的服务器,这次给每个服务器(将它们填充到MAX),直到 我们用光了.
  6. 所有服务器现在将托管MIN或MAX区域.此外,任何托管大于等于MAX区域的服务器都可以保证以MAX结尾 平衡结束时的区域.这样可以确保最少的数量 可能的区域已移动.
  1. Determine the two valid numbers of regions each server should have, MIN=floor(average) and MAX=ceiling(average).
  2. Iterate down the most loaded servers, shedding regions from each so each server hosts exactly MAX regions. Stop once you reach a server that already has <= MAX regions. Order the regions to move from most recent to least.
  3. Iterate down the least loaded servers, assigning regions so each server has exactly MIN regions. Stop once you reach a server that already has >= MIN regions. Regions being assigned to underloaded servers are those that were shed in the previous step. It is possible that there were not enough regions shed to fill each underloaded server to MIN. If so we end up with a number of regions required to do so, neededRegions. It is also possible that we were able to fill each underloaded but ended up with regions that were unassigned from overloaded servers but that still do not have assignment. If neither of these conditions hold (no regions needed to fill the underloaded servers, no regions leftover from overloaded servers), we are done and return. Otherwise we handle these cases below.
  4. If neededRegions is non-zero (still have underloaded servers), we iterate the most loaded servers again, shedding a single server from each (this brings them from having MAX regions to having MIN regions).
  5. We now definitely have more regions that need assignment, either from the previous step or from the original shedding from overloaded servers. Iterate the least loaded servers filling each to MIN. If we still have more regions that need assignment, again iterate the least loaded servers, this time giving each one (filling them to MAX) until we run out.
  6. All servers will now either host MIN or MAX regions. In addition, any server hosting >= MAX regions is guaranteed to end up with MAX regions at the end of the balancing. This ensures the minimal number of regions possible are moved.

TODO:我们最多可以将一个区域的数量重新分配给一个 特定的服务器,以报告最大负载为多少.我们应该吗 只是将所有作业保留在内存中?有异议吗?这是否意味着我们 在HMaster上需要HeapSize吗?还是只是细心的显示器? (目前的想法 是我们会将所有作业保留在内存中)

TODO: We can at-most reassign the number of regions away from a particular server to be how many they report as most loaded. Should we just keep all assignment in memory? Any objections? Does this mean we need HeapSize on HMaster? Or just careful monitor? (current thinking is we will hold all assignments in memory)

这篇关于我如何在HBase中进行预分割的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆