为什么选择不同的分区列很慢? [英] Why select distinct partitioned column is very slow?

查看:50
本文介绍了为什么选择不同的分区列很慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个表zhihu_answer_increment,它被列ym分区.当我执行查询 select distinct(ym) from zhihu.zhihu_answer_increment; 时,花了 1 多分钟才完成.在此过程中,hive 启动了 map-reduce 作业.这是日志:

I hava a table zhihu_answer_increment, it was partitioned by column ym. When I execute query select distinct(ym) from zhihu.zhihu_answer_increment;, it took over 1 min to finish. During the process, hive launched a map-reduce job. here is the log:

INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.59 sec   HDFS Read: 14969 HDFS Write: 106 HDFS EC Read: 0 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 3 seconds 590 msec
INFO  : Completed executing command(queryId=hive_20191015113300_a6f58bad-f35b-4243-890a-a0d9ba9a5210); Time taken: 95.048 seconds
INFO  : OK

相比之下,show partitions zhihu_answer_increment;返回结果要快得多(只需几秒钟).但是我需要将 select distinct(ym) from zhihu.zhihu_answer_increment 作为子查询.

In comparison, show partitions zhihu_answer_increment; return the result much faster (just few seconds). But I need to take select distinct(ym) from zhihu.zhihu_answer_increment as a sub-query.

那么我该如何优化这个查询呢?而且我不明白为什么它启动了一个 mapreduce 作业,就我而言,仅检查分区目录就足以进行此查询.或者我的考虑太简单.

So how can I optimise this query? and I don't understand why it launched a mapreduce job, as far as I'm concerned, only checking the partition directory is enough for this query. Or maybe my consideration is too simple.

推荐答案

如果您可以使用 shell,则使用 SHOW PARTITIONS 将分区列表转换为变量(工作速度快)并使用 AWK 将其转换为逗号分隔的列表.然后使用带有 partition_list 的变量参数化您的 hive 脚本:

If you can use shell, then get partition list into variable using SHOW PARTITIONS (works fast) and transform it to the comma-separated list using AWK. Then parametrize your hive script using the variable with partition_list:

像这样:

partition_list=$(hive -S -e "show partitions your_table;"  |  awk -vq="'" -F "=" 'f&&!NF{exit}{f=1}f{printf c q $2 q}{c=","}')


hive -e "select 1 from your_table where partition_column in (${partition_list}) limit 1"

这篇关于为什么选择不同的分区列很慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆