在Hive中分区和分区表之间有什么区别? [英] What is the difference between partitioning and bucketing a table in Hive ?

查看:806
本文介绍了在Hive中分区和分区表之间有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道两者都是在表格中的一列上执行的,但每种操作又有什么不同。 解析方案

分区数据通常用于水平分布负载,这具有性能优势,并有助于以逻辑方式组织数据。 :如果我们正在处理一个大的 employee 表,并且经常运行带有 WHERE 将结果限制在特定国家或部门的条款。对于更快的查询响应,Hive表可以是 PARTITIONED BY(country STRING,DEPT STRING)。分区表更改Hive如何构建数据存储,Hive现在将创建反映分区结构的子目录,如


... / employees / country = ABC / DEPT = XYZ

如果员工的查询限制来自 country = ABC ,它只会扫描一个目录 country = ABC 的内容。这可以显着提高查询性能,但前提是分区方案反映常见的过滤。分区功能在Hive中非常有用,但是,创建分区过多的设计可能会优化某些查询,但对其他重要查询不利。其他缺点是分区太多是因为必须将文件系统的所有元数据保存在内存中,因此大量的Hadoop文件和目录不必要地创建,并且会对NameNode造成额外负担。



< Bucketing 是另一种将数据集分解为更易管理的部分的技术。例如,假设一个使用 date 作为顶级分区并且 employee_id 作为第二级分区的表导致太多的小分区。相反,如果我们存储雇员表并使用 employee_id 作为分包列,则此列的值将由用户定义的数字散列为存储分区。具有相同 employee_id 的记录将始终存储在同一个存储分区中。假设 employee_id 的数量远远大于桶的数量,每个桶将有许多 employee_id 。在创建表时,您可以指定像 CLUSTERED BY(employee_id)INTO XX BUCKETS; 其中XX是存储桶的数量。 Bucketing具有几个优点。桶的数量是固定的,因此它不会随数据波动。如果两个表由 employee_id 分区,Hive可以创建一个逻辑上正确的采样。 Bucketing还有助于实现高效的地图边连接等。

I know both is performed on a column in the table but how is each operation different.

解决方案

Partitioning data is often used for distributing load horizontally, this has performance benefit, and helps in organizing data in a logical fashion. Example: if we are dealing with a large employee table and often run queries with WHERE clauses that restrict the results to a particular country or department . For a faster query response Hive table can be PARTITIONED BY (country STRING, DEPT STRING). Partitioning tables changes how Hive structures the data storage and Hive will now create subdirectories reflecting the partitioning structure like

.../employees/country=ABC/DEPT=XYZ.

If query limits for employee from country=ABC, it will only scan the contents of one directory country=ABC. This can dramatically improve query performance, but only if the partitioning scheme reflects common filtering. Partitioning feature is very useful in Hive, however, a design that creates too many partitions may optimize some queries, but be detrimental for other important queries. Other drawback is having too many partitions is the large number of Hadoop files and directories that are created unnecessarily and overhead to NameNode since it must keep all metadata for the file system in memory.

Bucketing is another technique for decomposing data sets into more manageable parts. For example, suppose a table using date as the top-level partition and employee_id as the second-level partition leads to too many small partitions. Instead, if we bucket the employee table and use employee_id as the bucketing column, the value of this column will be hashed by a user-defined number into buckets. Records with the same employee_id will always be stored in the same bucket. Assuming the number of employee_id is much greater than the number of buckets, each bucket will have many employee_id. While creating table you can specify like CLUSTERED BY (employee_id) INTO XX BUCKETS; where XX is the number of buckets . Bucketing has several advantages. The number of buckets is fixed so it does not fluctuate with data. If two tables are bucketed by employee_id, Hive can create a logically correct sampling. Bucketing also aids in doing efficient map-side joins etc.

这篇关于在Hive中分区和分区表之间有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆