蜂巢减速器数量分组和计数(不同) [英] Hive number of reducers in group by and count(distinct)

查看:61
本文介绍了蜂巢减速器数量分组和计数(不同)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人告诉我count(distinct)可能会导致数据偏斜,因为只使用了一个reducer.

I was told that count(distinct ) may result in data skew because only one reducer is used.

我使用了一个包含50亿条数据和2个查询的表进行了测试,

I made a test using a table with 5 billion data with 2 queries,

查询A:

select count(distinct columnA) from tableA

查询B:

select count(columnA) from
(select columnA from tableA group by columnA) a

实际上,查询A大约需要1000-1500秒,而查询B则需要500-900秒.结果似乎是预期的.

Actually, query A takes about 1000-1500 seconds while query B takes 500-900 seconds. The result seems expected.

但是,我意识到两个查询都使用 370个映射器 1个reducers ,并且thay几乎具有相同的累积CPU秒数.这意味着它们没有属性差异,并且时间差异可能是由集群负载引起的.

However, I realize that both queries use 370 mappers and 1 reducers and thay have almost the same cumulative CPU seconds. And this means they do not have geneiune difference and the time difference may caused by cluster load.

我很困惑为什么所有人都使用一个1减速器,我什至尝试了 mapreduce.job.reduces 但它不起作用.顺便说一句,如果他们都使用1个reducer,为什么人们建议不要使用 count(distinct),并且看来数据偏斜是不可避免的?

I am confused why the all use one 1 reducers and I even tried mapreduce.job.reduces but it does not work. Btw, if they all use 1 reducers why do people suggest not to use count(distinct ) and it seems data skew is not avoidable?

推荐答案

两个查询都使用与预期数量相同的映射器和单个final reducer,这也是预期的,因为您需要单个标量计数结果.同一顶点上的多个减速器独立运行,隔离,每个减速器都将产生自己的输出,这就是为什么最后一级具有单个减速器的原因.区别在于计划.

Both queries are using the same number of mappers which is expected and single final reducer, which is also expected because you need single scalar count result. Multiple reducers on the same vertex are running independently, isolated and each will produce it's own output, this is why the last stage has single reducer. The difference is in the plan.

在第一个查询执行中,单个化简器读取每个映射器输出并对所有数据进行不同的计数计算,从而处理过多的数据.

In the first query execution single reducer reads each mapper output and does distinct count calculation on all the data, it process too much data.

第二个查询正在使用中间聚合,最终的reducer接收部分聚合的数据(在上一步中聚合的不同值).Final Reducer需要再次聚合部分结果才能获得最终结果,与第一种情况相比,它的数据量可能少得多.

Second query is using intermediate aggrgation and final reducer receives partially aggregated data (distinct values aggregated on previous step). Final reducer needs to aggregate partial results again to get final result, it can be much less data than in the first case.

从Hive 1.2.0开始,对count(distinct)进行了优化,您无需重写查询.设置此属性: hive.optimize.distinct.rewrite = true

As of Hive 1.2.0 there is optimization for count(distinct) and you do not need to rewrite query. Set this property: hive.optimize.distinct.rewrite=true

还存在映射器聚合(映射器还可以预聚合数据,并在其数据部分范围内生成不同的值-拆分)设置此属性以允许地图端聚合: hive.map.aggr =是

Also there is mapper aggregation (mapper can pre-aggregate data also and produce distinct values in the scope of their portion of data - splits) Set this property to allow map-side aggregation: hive.map.aggr=true

使用 EXPLAIN 命令检查执行计划中的差异

use EXPLAIN command to check the difference in the execution plan.

另请参阅以下答案: https://stackoverflow.com/a/51492032/2700344

这篇关于蜂巢减速器数量分组和计数(不同)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆