Reg:配置单元中查询优化器的效率 [英] Reg : Efficiency among query optimizers in hive

查看:64
本文介绍了Reg:配置单元中查询优化器的效率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在阅读了有关查询优化技术的知识之后,我开始了解以下技术.

  1.索引-位图和BTree2.分区3.桶装 

我得到了分区和存储分区以及何时使用它们的区别,但是我仍然对索引的实际工作方式感到困惑.索引的元数据存储在哪里?是存储它的namenode吗?即,实际上,在创建分区或存储桶时,我们可以在hdfs中看到多个目录,这些目录解释了查询性能的优化,但如何可视化索引呢?尽管存在分区和存储桶的问题,但它们是否确实在现实生活中使用过?

对于上述查询,请帮我,Hadoop和Hive开发人员社区是否有专门的页面?

解决方案

  1. Hive中的索引在现实生活中从未使用过,也从未高效使用,正如@mazaneicha在注释中注意到的那样,Hive 3.0中的索引功能已完全删除,请阅读以下Jira:
  2. 如果分区模式与表的筛选方式或表的加载方式相对应,则分区效率最高.(如果增量数据是整个分区,则可以并行加载分区,如果分区数据可以高效地工作)./p>

  3. 括弧可以帮助优化联接和分组依据,但是sort-merge-bucket-mapjoin具有严重的限制,因此效率也不高.两个表应具有相同的存储架构,在现实生活中这是罕见的,或者效率极低.加载存储区时,数据也应排序.

考虑将ORC与内置索引和Bloom过滤器一起使用,在表中保留较少的文件,以避免元数据过载,并避免映射器复制数千个文件.阅读此蜂巢面试问题中的分区和此 LanguageManual

Cloudera社区: https://community.cloudera.com/

After reading about query optimization techniques I came to know about the below techniques.

1. Indexing - bitmap and BTree
2. Partitioning
3. Bucketing

I got the difference between partitioning and bucketing, and when to use them but I'm still confused how indexes actually work. Where is the metadata for index is stored? Is it the namenode which is storing it? I.e., actually while creating partitions or buckets we can see multiple directories in hdfs which explains the query performance optimization but how to visualize indexes? Are they really used in real life despite partitioning and bucketing being in the picture?

Please help me for the above queries and is there's any dedicated page for hadoop and hive developers community?

解决方案

  1. Indexes in Hive were never used in real life and were never efficient and as @mazaneicha noticed in the comment Indexing feature is removed completely in Hive 3.0, read this Jira: HIVE-18448. It was a great try any way, thanks to Facebook support, valuable lessons have been learned.

But there are light-weight indexes in ORC (well, not actually classic indexes but min, max and Bloom filter, it helps to prune stripes). ORC indexes also are most efficient is the data is sorted during insert (distribute+sort)

  1. Partitioning is the most efficient if partitioning schema corresponds to how the table is being filtered or how is it being loaded (allows to load partitions in parallel, if the increment data is the whole partition it works efficiently).

  2. Bucketing can help with optimizing joins and group by but sort-merge-bucket-mapjoin has serious restrictions making it also not efficient. Both tables should have the same bucketing schema, which in real life is rare or can be extremely inefficient. Also data should be sorted when loading buckets.

Consider using ORC with built-in indexes and Bloom filters, keep less number of files in your table to avoid metadata overload and avoid mappers copying thousands of files. Read this partitions in hive interview questions and this Sorted Table in Hive

Useful links.

Official documentation: LanguageManual

Cloudera community: https://community.cloudera.com/

这篇关于Reg:配置单元中查询优化器的效率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆