计算HIVE中的中值 [英] Calculating median values in HIVE

查看:1032
本文介绍了计算HIVE中的中值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有下表t1:

I have the following table t1:

key  value
 1   38.76
 1   41.19
 1   42.22
 2   29.35182
 2   28.32192
 3   33.66
 3   33.47
 3   33.35
 3   33.47
 3   33.11
 3   32.98
 3   32.5

我想计算每个关键组的中位数。根据文档,percentile_approx函数应该适用于此。每组的中位数值为:

I want to compute the median for each key group. According to the documentation, the percentile_approx function should work for this. The median values for each group are:

1  41.19
2  28.83
3  33.35

然而,percentile_approx函数会返回以下值:

However, the percentile_approx function returns these:

1  39.974999999999994
2  28.32192
3  33.23.0000000000004



<这显然不是中值。

Which clearly are not the median values.

这是我跑过的查询:

This was the query I ran:

select key, percentile_approx(value, 0.5, 10000) as median
from t1
group by key

似乎没有考虑到每个组的一个值,导致错误的中位数。排序不会影响结果。任何想法?

It seems to be not taking into account one value per group, resulting in a wrong median. Ordering does not affect the result. Any ideas?

推荐答案

在Hive中,不能通过使用可用的内置函数直接计算中位数。以下查询用于查找中位数。

In Hive, median cannot be calculated directly by using available built-in functions. Below query is used to find the median.

    set hive.exec.parallel=true;
    select temp1.key,temp2.value
    from 
      (
      select key,cast(sum(rank)/count(key) as int) as final_rank
      from
        (
        select key,value,
        row_number() over (partition by key order by value) as rank
        from t1
      ) temp
      group by key )temp1
    inner join
    ( select key,value,row_number() over (partition by key order by value) as rank
      from t1  )temp2
       on 
       temp1.key=temp2.key and
       temp1.final_rank=temp3.rank;

以上查询通过排序键的值来查找每个键的row_number。最后,它将采取给出中值的每个键的中间row_number。我还添加了一个参数hive.exec.parallel = true;,它可以并行运行独立任务。

Above query finds the row_number for each key by ordering the values for the key. Finally it will take the middle row_number of each key which gives the median value. Also I have added one more parameter "hive.exec.parallel=true;" which enables to run the independent tasks in parallel.

这篇关于计算HIVE中的中值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆