在 Hive 中对表进行分区和分桶有什么区别? [英] What is the difference between partitioning and bucketing a table in Hive ?

查看:22
本文介绍了在 Hive 中对表进行分区和分桶有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道两者都是在表中的一列上执行的,但每个操作有何不同.

I know both is performed on a column in the table but how is each operation different.

推荐答案

分区 数据通常用于水平分配负载,这具有性能优势,并有助于以逻辑方式组织数据.示例:如果我们正在处理一个大型 employee 表,并且经常使用 WHERE 子句运行查询,这些子句将结果限制为特定国家或部门.为了更快的查询响应,Hive 表可以是 PARTITIONED BY (country STRING, DEPT STRING).分区表改变了 Hive 结构数据存储的方式,Hive 现在将创建反映分区结构的子目录,如

Partitioning data is often used for distributing load horizontally, this has performance benefit, and helps in organizing data in a logical fashion. Example: if we are dealing with a large employee table and often run queries with WHERE clauses that restrict the results to a particular country or department . For a faster query response Hive table can be PARTITIONED BY (country STRING, DEPT STRING). Partitioning tables changes how Hive structures the data storage and Hive will now create subdirectories reflecting the partitioning structure like

.../employees/country=ABC/DEPT=XYZ.

.../employees/country=ABC/DEPT=XYZ.

如果查询限制country=ABC的employee,它只会扫描一个目录country=ABC的内容.这可以显着提高查询性能,但前提是分区方案反映了通用过滤.分区功能在 Hive 中非常有用,但是,创建过多分区的设计可能会优化某些查询,但对其他重要查询不利.另一个缺点是分区过多会导致不必要地创建大量 Hadoop 文件和目录以及 NameNode 的开销,因为它必须将文件系统的所有元数据保存在内存中.

If query limits for employee from country=ABC, it will only scan the contents of one directory country=ABC. This can dramatically improve query performance, but only if the partitioning scheme reflects common filtering. Partitioning feature is very useful in Hive, however, a design that creates too many partitions may optimize some queries, but be detrimental for other important queries. Other drawback is having too many partitions is the large number of Hadoop files and directories that are created unnecessarily and overhead to NameNode since it must keep all metadata for the file system in memory.

分桶是另一种将数据集分解为更易于管理的部分的技术.例如,假设一个表以date为顶级分区,employee_id为二级分区,导致小分区过多.相反,如果我们对员工表进行分桶并使用 employee_id 作为分桶列,则该列的值将通过用户定义的数字散列到桶中.具有相同 employee_id 的记录将始终存储在同一个存储桶中.假设employee_id的数量远大于bucket的数量,那么每个bucket会有很多employee_id.在创建表时,您可以指定像 CLUSTERED BY (employee_id) INTO XX BUCKETS; ,其中 XX 是存储桶的数量.分桶有几个优点.桶的数量是固定的,所以它不会随着数据而波动.如果两个表按 employee_id 进行分桶,Hive 可以创建逻辑上正确的抽样.分桶还有助于进行有效的地图侧连接等.

Bucketing is another technique for decomposing data sets into more manageable parts. For example, suppose a table using date as the top-level partition and employee_id as the second-level partition leads to too many small partitions. Instead, if we bucket the employee table and use employee_id as the bucketing column, the value of this column will be hashed by a user-defined number into buckets. Records with the same employee_id will always be stored in the same bucket. Assuming the number of employee_id is much greater than the number of buckets, each bucket will have many employee_id. While creating table you can specify like CLUSTERED BY (employee_id) INTO XX BUCKETS; where XX is the number of buckets . Bucketing has several advantages. The number of buckets is fixed so it does not fluctuate with data. If two tables are bucketed by employee_id, Hive can create a logically correct sampling. Bucketing also aids in doing efficient map-side joins etc.

这篇关于在 Hive 中对表进行分区和分桶有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆