计算HIVE中的中值 [英] Calculating median values in HIVE
问题描述
我有下表t1:
I have the following table t1:
key value
1 38.76
1 41.19
1 42.22
2 29.35182
2 28.32192
3 33.66
3 33.47
3 33.35
3 33.47
3 33.11
3 32.98
3 32.5
我想计算每个关键组的中位数。根据文档,percentile_approx函数应该适用于此。每组的中位数值为:
I want to compute the median for each key group. According to the documentation, the percentile_approx function should work for this. The median values for each group are:
1 41.19
2 28.83
3 33.35
然而,percentile_approx函数会返回以下值:
However, the percentile_approx function returns these:
1 39.974999999999994
2 28.32192
3 33.23.0000000000004
<这显然不是中值。
Which clearly are not the median values.
这是我跑过的查询:
This was the query I ran:
select key, percentile_approx(value, 0.5, 10000) as median
from t1
group by key
似乎没有考虑到每个组的一个值,导致错误的中位数。排序不会影响结果。任何想法?
It seems to be not taking into account one value per group, resulting in a wrong median. Ordering does not affect the result. Any ideas?
推荐答案
在Hive中,不能通过使用可用的内置函数直接计算中位数。以下查询用于查找中位数。
In Hive, median cannot be calculated directly by using available built-in functions. Below query is used to find the median.
set hive.exec.parallel=true;
select temp1.key,temp2.value
from
(
select key,cast(sum(rank)/count(key) as int) as final_rank
from
(
select key,value,
row_number() over (partition by key order by value) as rank
from t1
) temp
group by key )temp1
inner join
( select key,value,row_number() over (partition by key order by value) as rank
from t1 )temp2
on
temp1.key=temp2.key and
temp1.final_rank=temp3.rank;
以上查询通过排序键的值来查找每个键的row_number。最后,它将采取给出中值的每个键的中间row_number。我还添加了一个参数hive.exec.parallel = true;,它可以并行运行独立任务。
Above query finds the row_number for each key by ordering the values for the key. Finally it will take the middle row_number of each key which gives the median value. Also I have added one more parameter "hive.exec.parallel=true;" which enables to run the independent tasks in parallel.
这篇关于计算HIVE中的中值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!