计算HIVE中的中值 [英] Calculating median values in HIVE

查看：1032 发布时间：2018/6/12 14:18:08 statistics hive hiveql median percentile

本文介绍了计算HIVE中的中值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有下表t1：

I have the following table t1:

key value 1 38.76 1 41.19 1 42.22 2 29.35182 2 28.32192 3 33.66 3 33.47 3 33.35 3 33.47 3 33.11 3 32.98 3 32.5

我想计算每个关键组的中位数。根据文档，percentile_approx函数应该适用于此。每组的中位数值为：

I want to compute the median for each key group. According to the documentation, the percentile_approx function should work for this. The median values for each group are:

1 41.19 2 28.83 3 33.35

然而，percentile_approx函数会返回以下值：

However, the percentile_approx function returns these:

1 39.974999999999994 2 28.32192 3 33.23.0000000000004

<这显然不是中值。

Which clearly are not the median values.

这是我跑过的查询：

This was the query I ran:

select key, percentile_approx(value, 0.5, 10000) as median from t1 group by key

似乎没有考虑到每个组的一个值，导致错误的中位数。排序不会影响结果。任何想法？

It seems to be not taking into account one value per group, resulting in a wrong median. Ordering does not affect the result. Any ideas?

推荐答案

在Hive中，不能通过使用可用的内置函数直接计算中位数。以下查询用于查找中位数。

In Hive, median cannot be calculated directly by using available built-in functions. Below query is used to find the median.

set hive.exec.parallel=true; select temp1.key,temp2.value from ( select key,cast(sum(rank)/count(key) as int) as final_rank from ( select key,value, row_number() over (partition by key order by value) as rank from t1 ) temp group by key )temp1 inner join ( select key,value,row_number() over (partition by key order by value) as rank from t1 )temp2 on temp1.key=temp2.key and temp1.final_rank=temp3.rank;

以上查询通过排序键的值来查找每个键的row_number。最后，它将采取给出中值的每个键的中间row_number。我还添加了一个参数hive.exec.parallel = true;，它可以并行运行独立任务。

Above query finds the row_number for each key by ordering the values for the key. Finally it will take the middle row_number of each key which gives the median value. Also I have added one more parameter "hive.exec.parallel=true;" which enables to run the independent tasks in parallel.

这篇关于计算HIVE中的中值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

计算HIVE中的中值 [英] Calculating median values in HIVE

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

计算HIVE中的中值 [英] Calculating median values in HIVE

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭