组中减速器的 Hive 数量和计数(不同) [英] Hive number of reducers in group by and count(distinct)

查看:24
本文介绍了组中减速器的 Hive 数量和计数(不同)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人告诉我 count(distinct ) 可能会导致数据倾斜,因为只使用了一个减速器.

I was told that count(distinct ) may result in data skew because only one reducer is used.

我使用一个包含 50 亿数据和 2 个查询的表进行了测试,

I made a test using a table with 5 billion data with 2 queries,

查询 A:

select count(distinct columnA) from tableA

查询 B:

select count(columnA) from
(select columnA from tableA group by columnA) a

实际上,查询 A 大约需要 1000-1500 秒,而查询 B 需要 500-900 秒.结果似乎在意料之中.

Actually, query A takes about 1000-1500 seconds while query B takes 500-900 seconds. The result seems expected.

但是,我意识到这两个查询都使用 370 个映射器1 个 reducers 并且它们具有几乎相同的 累积 CPU 秒.这意味着它们没有基因差异,时间差异可能是由集群负载造成的.

However, I realize that both queries use 370 mappers and 1 reducers and thay have almost the same cumulative CPU seconds. And this means they do not have geneiune difference and the time difference may caused by cluster load.

我很困惑为什么所有人都使用一个减速器,我什至尝试过 mapreduce.job.reduces 但它不起作用.顺便说一句,如果他们都使用 1 个减速器,为什么人们建议不要使用 count(distinct ) 并且似乎无法避免数据倾斜?

I am confused why the all use one 1 reducers and I even tried mapreduce.job.reduces but it does not work. Btw, if they all use 1 reducers why do people suggest not to use count(distinct ) and it seems data skew is not avoidable?

推荐答案

两个查询都使用相同数量的映射器,这是预期的,并且使用单个最终减速器,这也是预期的,因为您需要单个标量计数结果.同一个顶点上的多个减速器独立运行,隔离,每个都会产生自己的输出,这就是为什么最后一个阶段只有一个减速器.不同之处在于计划.

Both queries are using the same number of mappers which is expected and single final reducer, which is also expected because you need single scalar count result. Multiple reducers on the same vertex are running independently, isolated and each will produce it's own output, this is why the last stage has single reducer. The difference is in the plan.

在第一次查询执行中,单个reducer读取每个mapper输出并对所有数据进行distinct count计算,处理的数据太多.

In the first query execution single reducer reads each mapper output and does distinct count calculation on all the data, it process too much data.

第二个查询使用中间聚合,最终的 reducer 接收部分聚合的数据(在上一步聚合的不同值).最终reducer需要再次聚合部分结果以获得最终结果,它可能比第一种情况少得多.

Second query is using intermediate aggrgation and final reducer receives partially aggregated data (distinct values aggregated on previous step). Final reducer needs to aggregate partial results again to get final result, it can be much less data than in the first case.

从 Hive 1.2.0 开始,对 count(distinct) 进行了优化,您无需重写查询.设置这个属性:hive.optimize.distinct.rewrite=true

As of Hive 1.2.0 there is optimization for count(distinct) and you do not need to rewrite query. Set this property: hive.optimize.distinct.rewrite=true

还有映射器聚合(映射器也可以预先聚合数据并在其数据部分范围内产生不同的值 - 拆分)设置此属性以允许映射端聚合:hive.map.aggr=真的

Also there is mapper aggregation (mapper can pre-aggregate data also and produce distinct values in the scope of their portion of data - splits) Set this property to allow map-side aggregation: hive.map.aggr=true

使用 EXPLAIN 命令查看执行计划的差异.

use EXPLAIN command to check the difference in the execution plan.

另见此答案:https://stackoverflow.com/a/51492032/2700344

这篇关于组中减速器的 Hive 数量和计数(不同)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆