Hive(Bigdata) - 分段和索引之间的区别 [英] Hive(Bigdata)- difference between bucketing and indexing
问题描述
Hive中一个表的分区和索引之间的主要区别是什么?
主要区别在于目标:
Hive索引的目标是提高查询查询在表的某些列上的速度。如果没有索引,那么谓词如'WHERE tab1.col1 = 10'的查询将加载整个表或分区并处理所有行。但是,如果col1存在索引,那么只需要加载和处理文件的一部分。
索引变得更加重要当表格变得非常大时,现在你无疑知道,Hive在大型桌子上茁壮成长。
- Bucketing
它通常用于连接操作,因为您可以通过按特定键或id分段记录来优化连接。通过这种方式,当您想要执行联接操作时,具有相同键的记录将位于同一个桶中,然后联接操作将更快。你可以看到这就像一种将数据集分解成更易于管理的部分的技术。这个链接为您提供5个提示用于高效的Hive查询,其中之一是关于Bucketing。
What is the main difference between bucketing and indexing of a table in Hive?
The main difference is the goal:
- Indexing
The goal of Hive indexing is to improve the speed of query lookup on certain columns of a table. Without an index, queries with predicates like 'WHERE tab1.col1 = 10' load the entire table or partition and process all the rows. But if an index exists for col1, then only a portion of the file needs to be loaded and processed.
Indexes become even more essential when the tables grow extremely large, and as you now undoubtedly know, Hive thrives on large tables.
- Bucketing
It is usually used for join operations, because you can optimize joins by bucketing records by a specific 'key' or 'id'. In this way, when you want to do a join operation, records with the same 'key' will be in the same bucket and then the join operation will be faster. You can see this like a technique for decomposing data sets into more manageable parts. This link gives you 5 Tips for efficient Hive queries and one of them is about Bucketing.
这篇关于Hive(Bigdata) - 分段和索引之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!