HBase:创建多个表还是包含多个列的单个表? [英] HBase: Create multiple tables or single table with many columns?

查看:190
本文介绍了HBase:创建多个表还是包含多个列的单个表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

何时创建多个表而不是创建具有大量列的单个表是有意义的。我知道表通常只有几个列族(1-2),每个列族可以支持1000多个列。

When does it make sense to create multiple tables as opposed to a single table with a large number of columns. I understand that typically tables have only a few column families (1-2) and that each column family can support 1000+ columns.

何时创建单独的列有意义HBase在单个表中可能包含大量列的情况下表现良好的表?

When does it make sense to create separate tables when HBase seems to perform well with a potentially large number of columns within a single table?

推荐答案

在回答问题本身之前,让我首先说明一些主要因素。我将假定使用的文件系统是HDFS。

Before answering the question itself, let me first state some of the major factors that come into play. I am going to assume that the file system in use is HDFS.


  1. 一个表被划分为称为区域的键空间的非重叠分区。

  1. A table is divided into non-overlapping partitions of the keyspace called regions.

键范围->区域映射存储在称为meta的特殊单个区域表中。

The key-range -> region mapping is stored in a special single region table called meta.

一个区域的一个HBase列族中的数据存储在单个区域中HDFS目录。它通常是几个文件,但是出于所有目的和目的,我们可以假定某个列族的区域数据存储在HDFS上的一个文件中,该文件称为StoreFile / HFile。

The data in one HBase column family for a region is stored in a single HDFS directory. It's usually several files but for all intents and purposes, we can assume that a region's data for a column family is stored in a single file on HDFS called a StoreFile / HFile.

StoreFile本质上是一个包含KeyValues的排序文件。 KeyValue逻辑上表示以下顺序:(RowLength,RowKey,FamilyLength,FamilyName,Qualifier,Timestamp,Type)。例如,如果您的区域中只有两个KV,而CF的键是相同的,但值在两列中,则这就是StoreFile的样子(除了它实际上是字节编码的,而且像length这样的元数据也是如此)存储如上所述)):

A StoreFile is essentially a sorted file containing KeyValues. A KeyValue logically represents the following in order: (RowLength, RowKey, FamilyLength, FamilyName, Qualifier, Timestamp, Type). For example, if you have only two KVs in your region for a CF where the key is same but values in two columns, this is how the StoreFile will look like (except that it's actually byte encoded, and metadata like length etc. is also stored as I mentioned above):

Key1:Family1:Qualifier1:Timestamp1:Value1:Put

Key1:Family1:Qualifier2:Timestamp2:Value2:Put



  • 将StoreFile分为 blocks (默认为64KB),每个数据块中包含的键范围由多级索引建立索引。可以使用索引+二进制搜索在单个块内进行随机查找。但是,扫描必须在扫描所需的第一个程序段中找到起始位置之后,才能依次遍历特定程序段。

  • The StoreFile is divided into blocks (default 64KB) and the key range contained in each data block is indexed by multi-level indexes. A random lookup inside a single block can be done using index + binary search. However, the scans have to go serially through a particular block after locating the starting position in the first block needed for scan.

    HBase是一个基于LSM树的数据库,这意味着它具有内存日志(称为 Memstore ),该日志会定期刷新到创建StoreFiles的文件系统。内存存储是针对特定列族的单个区域内的所有列共享的。

    HBase is a LSM-tree based database which means that it has an in-memory log (called Memstore) that is periodically flushed to the filesystem creating the StoreFiles. The Memstore is shared for all columns inside a single region for a particular column family.

    处理从HBase读取数据/向HBase写入数据,但是以上给出的信息在概念上是正确的。鉴于以上陈述,与其他方法相比,具有多个列和多个表的优点如下:

    There are several optimizations involved while dealing with reading/writing data from/to HBase, but the information given above holds true conceptually. Given the above statements, the following are the pros of having several columns vs several tables over the other approach:

    具有多个列的单表


    1. 由于前缀编码,磁盘上的压缩效果更好,因为Key的所有数据都存储在一起,而不是存储在表中的多个文件中。由于数据量较小,这也导致磁盘活动减少。

    2. 元表上的负荷较小,因为区域总数将变小。仅一张表将具有N个区域,而不是M个表将具有N * M个区域。这意味着在大型表上查找区域的速度更快,元数据表上的争用较低,这是大型集群所关心的问题。

    3. 需要读取几列时,读取速度更快,IO放大率较低(导致磁盘活动较少)

    4. 为单个行键写入多列时,您将获得行级事务,批处理和其他性能优化的优势。

    1. Better on-disk compression due to prefix encoding since all data for a Key is stored together rather than on multiple files across tables. This also results in reduced disk activity due to smaller data size.
    2. Lesser load on meta table because the total number regions is going to be smaller. You'll have N number of regions for just one table rather than N*M regions for M tables. This means faster region lookup and low contention on meta table, which is a concern for large clusters.
    3. Faster reads and low IO amplification (causing less disk activity) when you need to read several columns for a single row key.
    4. You get advantage of row level transactions, batching and other performance optimizations when writing to multiple columns for a single row key.

    何时使用此


    1. 如果要跨多个执行行级事务列,您必须将它们放在一个表中。

    2. 即使您不需要行级事务,但您经常会写多个或从多个查询中查询同一行键的列数。一个好的经验法则是,如果平均而言,您的列中有超过20%的列具有单个行的值,则应尝试将它们放到一个表中。

    3. 何时您的列过多。

    1. If you want to perform row level transactions across multiple columns, you have to put them in a single table.
    2. Even when you don't need row level transactions, but you often write to or query from multiple columns for the same row key. A good rule for thumb is that if on an average, more than 20% for your columns have values for a single row, you should try to put them together in a single table.
    3. When you have too many columns.

    多个表


    1. 对每个表进行更快的扫描,如果扫描仅主要关注一列,则IO放大率较低(请记住,在扫描中进行顺序查找会不必要地读取它们不需要的列)。

    2. 良好的数据逻辑分离,尤其是当您不需要跨列共享行键时。对于一种类型的行键,只有一张表。

    何时使用


    1. 当数据有明确的逻辑分离时。例如,如果行键架构在不同的列集上有所不同,请将这些列集放在单独的表中。

    2. 当只有一小部分列具有行键的值时(请参见

    3. 您要为不同的列集使用不同的存储配置。例如。 TTL,压缩率,阻止的文件计数,内存大小等(在此用例中,请参见下面的一种更好的方法)。

    种类:在单个表中有多个CF

    从上面可以看到,两种方法都有其优点。如果您对多个列具有相同的行键结构(因此,您希望共享行键以提高存储效率或需要跨列进行事务处理)但数据非常稀疏(这意味着您只写/读),那么选择将变得非常困难行键的一小部分列)。
    在这种情况下,您似乎需要两全其美。这就是列族的用处。如果您可以将列集划分为逻辑子集,而这些子集通常只访问/读取/写入单个子集,或者您需要每个子集的存储级别配置(例如TTL,存储类,编写繁重的压缩计划)等),则可以将每个子集设为一个列族。
    由于特定列族的数据存储在单个文件(文件集)中,因此在读取列的子集的同时不会降低扫描速度,您可以获得更好的局部性。

    As you can see from above, there are pros of both the approaches. The choice becomes really difficult in cases where you have same structure of row key for several columns (so, you want to share row key for storage efficiency or need transactions across columns) but the data is very sparse (which means you write/read only small percentage of columns for a row key). It seems like you need the best of both worlds in this case. That's where column families come in. If you can partition your column set into logical subsets where you mostly access/read/write only to a single subset, or you need storage level configs per subset (like TTL, Storage class, write heavy compaction schedule etc.), then you can make each subset a column family. Since data for a particular column family is stored in single file (set of files), you get better locality while reading a subset of columns without slowing down the scans.

    但是,有一个陷阱

    不要尝试不必要地使用列族。与它们相关联的是一个成本,由于区域级别的写锁定,监视等在HBase中的工作方式,HBase在10个以上的CF上表现不佳。仅在跨CF的列之间具有逻辑关系但通常不跨CF执行操作或需要为不同CF设置不同的存储配置时,才使用CF。
    如果您在所有列之间共享行键架构,则仅使用包含所有列的单个CF完全可以,除非您的数据集非常稀疏,在这种情况下,您可能需要基于上述内容的不同CF或不同表点。

    Do not try to unnecessarily use column families. There is a cost associated with them, and HBase does not do well with 10+ CFs due to how region level write locks, monitoring etc. work in HBase. Use CFs only if you have a logical relationship between columns across CFs but you don't generally perform operations across CFs or need to have different storage configs for different CFs. It's perfectly fine to use only a single CF containing all your columns if you share row key schema across them, unless you have a very sparse data set, in which case you might need different CFs or different tables based on above mentioned points.

    这篇关于HBase:创建多个表还是包含多个列的单个表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆