Greenplum中的滚动(移动)中值 [英] Rolling (moving) median in Greenplum

查看:163
本文介绍了Greenplum中的滚动(移动)中值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想为Greenplum中的一列计算滚动中位数,即:

I would like to calculate the rolling median for a column in Greenplum, i.e. as below:

|  x | rolling_median_x |
| -- + ---------------- |
|  4 |                4 |
|  1 |              2.5 |
|  3 |                3 |
|  2 |              2.5 |
|  1 |                2 |
|  6 |              2.5 |
|  9 |                3 |

x是整数,并且对于每一行,rolling_median_x显示当前行和先前行的x中位数.例如.第三行rolling_median_x = median(4, 1, 3) = 3.

x is an integer and for each row rolling_median_x shows the median of x for the current and preceding rows. E.g. for the third row rolling_median_x = median(4, 1, 3) = 3.

到目前为止我发现的东西:

Things I've found out so far:

  • median函数不能用作框架窗口函数,即median(x) OVER(RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
  • 许多其他功能(例如percent_ranknth_value
  • )也是如此 此版本的Greenplum不支持
  • 递归自联接
  • the median function can't be used as a framed window function, i.e. median(x) OVER(RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
  • the same is true for many other function such as percent_rank or nth_value
  • recursive self join is not supported in this version of Greenplum

事实上,我找不到合适的文档,有关这些文档可以在Greenplum中用作框架窗口函数...

As a matter of fact I was unable to find proper documentation on which functions can be used as framed window function in Greenplum...

我使用的是Greenplum 4.3.4.0(基于Postgres 8.2.15),不幸的是更新不是一种选择.

I'm using Greenplum 4.3.4.0 (which is based on Postgres 8.2.15) and updating is not an option unfortunately.

推荐答案

一个评论-来自维基百科的引用:订购

One remark - a citate from Wikipedia: ORDER BY

ORDER BY是对结果集中的行进行排序的唯一方法.没有 此子句,关系数据库系统可以在任何情况下返回行 订单..如果需要订购,则必须在以下位置提供ORDER BY 应用程序发送的SELECT语句.虽然有些数据库 系统允许在子选择中指定ORDER BY子句或 视图定义中,存在不起作用.视图是 逻辑关系表,关系模型要求: 表是一组行,表示没有排序顺序.

ORDER BY is the only way to sort the rows in the result set. Without this clause, the relational database system may return the rows in any order. If an ordering is required, the ORDER BY must be provided in the SELECT statement sent by the application. Although some database systems allow the specification of an ORDER BY clause in subselects or view definitions, the presence there has no effect. A view is a logical relational table, and the relational model mandates that a table is a set of rows, implying no sort order whatsoever.


由于您需要计算当前和先前行的中位数,因此表中必须具有定义行顺序的附加行,并且可以使用确定哪些行在给定行之前,哪些行在给定行之后.
假设这样的id列:


Since you need to calculate the median for the current and preceding rows, you must have in the table an additional row that defines the order of rows and can be used to determine which rows precede given row and which ones come after.
Let say some id column like this:

| id | x | rolling_median_x |
|----|---|------------------|
|  1 | 4 |                4 |
|  2 | 1 |              2.5 |
|  3 | 3 |                3 |
|  4 | 2 |              2.5 |
|  5 | 1 |                2 |
|  6 | 6 |              2.5 |
|  7 | 9 |                3 |


如果无法使用解析函数,请尝试使用纯SQL.
本文显示了各种方法SQL计算中位数的方法.
我认为,亨德森的中位数会最适合我们的需求:


If you cannot use analytic functions, then try pure SQL.
This article shows various methods of computing the Median with SQL.
I think the Henderson’s Median would be best for our needs:

SELECT CASE COUNT(*) % 2
       WHEN 0        -- even sized table
       THEN (P1.part_wgt + MIN(CASE WHEN P2.part_wgt > P1.part_wgt
                                  THEN P2.part_wgt
                                  ELSE NULL END))/2.0
       ELSE P1.part_wgt --odd sized table
       END AS median 
  FROM Parts AS P1, Parts AS P2
 GROUP BY P1.part_wgt
HAVING COUNT(CASE WHEN P1.part_wgt >= P2.part_wgt
                  THEN 1
                  ELSE NULL END)
       = (COUNT(*) + 1) / 2;


只需将每行的查询作为一个依赖子查询运行,一般的想法是这样的:


Just run this query for each row as a dependent subquery, a general idea is like this:

SELECT t.*, (
        SELECT .... Henderson's query FROM table x
        WHERE x.id <= t.id
        ......
       ) As our_median
FROM table t


您可以在此演示中找到示例实现


You can find an example implementation in this demo

SELECT t.*, (
    SELECT CASE COUNT(*) % 2
           WHEN 0        -- even sized table
           THEN (P1.x + MIN(CASE WHEN P2.x > P1.x
                                      THEN P2.x
                                      ELSE NULL END))/2.0
           ELSE P1.x --odd sized table
           END AS median 
      FROM Table333 AS P1, Table333 AS P2
      WHERE p1.id <= t.id AND p2.id <= t.id
     GROUP BY P1.x
    HAVING COUNT(CASE WHEN P1.x >= P2.x
                      THEN 1
                      ELSE NULL END)
           = (COUNT(*) + 1) / 2
    ) as Our_median
FROM Table333 t;

| id | x | rolling_median_x | our_median |
|----|---|------------------|------------|
|  1 | 4 |                4 |          4 |
|  2 | 1 |              2.5 |        2.5 |
|  3 | 3 |                3 |          3 |
|  4 | 2 |              2.5 |        2.5 |
|  5 | 1 |                2 |          2 |
|  6 | 6 |              2.5 |        2.5 |
|  7 | 9 |                3 |          3 |

此查询可能会很慢-这是您必须拥有较旧版本的PostgreSQL的价格

This query will probably be slow - this is a price you must pay for having ancient version of PostgreSQL

这篇关于Greenplum中的滚动(移动)中值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆