Hive(Bigdata) - 分段和索引之间的区别 [英] Hive(Bigdata)- difference between bucketing and indexing

查看:226
本文介绍了Hive(Bigdata) - 分段和索引之间的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Hive中一个表的分区和索引之间的主要区别是什么?

解决方案

主要区别在于目标:


  • 索引




Hive索引的目标是提高查询查询在表的某些列上的速度。如果没有索引,那么谓词如'WHERE tab1.col1 = 10'的查询将加载整个表或分区并处理所有行。但是,如果col1存在索引,那么只需要加载和处理文件的一部分。


索引变得更加重要当表格变得非常大时,现在你无疑知道,Hive在大型桌子上茁壮成长。


  • Bucketing



它通常用于连接操作,因为您可以通过按特定键或id分段记录来优化连接。通过这种方式,当您想要执行联接操作时,具有相同键的记录将位于同一个桶中,然后联接操作将更快。你可以看到这就像一种将数据集分解成更易于管理的部分的技术。这个链接为您提供5个提示用于高效的Hive查询,其中之一是关于Bucketing。

What is the main difference between bucketing and indexing of a table in Hive?

解决方案

The main difference is the goal:

  • Indexing

The goal of Hive indexing is to improve the speed of query lookup on certain columns of a table. Without an index, queries with predicates like 'WHERE tab1.col1 = 10' load the entire table or partition and process all the rows. But if an index exists for col1, then only a portion of the file needs to be loaded and processed.

Indexes become even more essential when the tables grow extremely large, and as you now undoubtedly know, Hive thrives on large tables.

  • Bucketing

It is usually used for join operations, because you can optimize joins by bucketing records by a specific 'key' or 'id'. In this way, when you want to do a join operation, records with the same 'key' will be in the same bucket and then the join operation will be faster. You can see this like a technique for decomposing data sets into more manageable parts. This link gives you 5 Tips for efficient Hive queries and one of them is about Bucketing.

这篇关于Hive(Bigdata) - 分段和索引之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆