窗口函数:在第一列之后按另一列划分 [英] Window functions: PARTITION BY one column after ORDER BY another

查看:108
本文介绍了窗口函数:在第一列之后按另一列划分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

免责声明:所示问题比我最初预期的要普遍得多。下面的示例来自一个解决方案,另一个问题。但是,现在我拿这个样本来解决更多问题-主要与时间序列有关(请看一下右侧栏中的链接部分)。

Disclaimer: The shown problem is much more general than I expected first. The example below is taken from a solution to another question. But now I was taking this sample for solving many problems more - mostly related to time series (have a look at the "Linked" section in the right bar).

所以我先尝试更笼统地解释这个问题:

So I am trying to explain the problem more generally first:

我正在使用PostgreSQL,但是我确信这个问题在支持DBMS的其他窗口功能(MS SQL Server,Oracle等)中也存在。

窗口函数可用于将某些值组合在一起通过共同的属性或值。例如,您可以按日期对行进行分组。然后,您可以计算每个日期内的最大值,平均值或计数行数之类的值。

Window functions can be used to group certain values together by a common attribute or value. For example you can group rows by a date. Then you are able to calculate the max value within every single date or an average value or counting rows or whatever.

这可以通过定义 PARTITION 。按日期分组将与 PARTITION BY date_column 一起使用。现在,您想执行一个需要在组内进行特殊排序的操作(计算行号或汇总一列)。可以使用 PARTITON BY date_column ORDER BY an_attribute_column 来完成。

This can be achieved by defining a PARTITION. Grouping by dates would work with PARTITION BY date_column. Now you want to do an operation which needs a special order within your groups (calculating row numbers or sum up a column). This can be done with PARTITON BY date_column ORDER BY an_attribute_column.

现在考虑更好地解决时间序列问题。如果没有日期但是有时间戳怎么办。然后,您将无法再按时间列进行分组。但是,尽管如此,按添加顺序分析数据可能很重要(也许时间戳就是数据集的创建时间)。然后,您意识到一些连续的行具有相同的值,并且您希望通过该公用值对数据进行分组。但是,线索是这些行具有不同的时间戳。

Now think about a finer resolution of time series. What if you do not have dates but timestamps. Then you cannot group by the time column anymore. But nevertheless it might be important to analyse the data in the order they were added (maybe the timestamp is the creating time of your data set). Then you realize that some consecutive rows have the same value and you want to group your data by this common value. But the clue is that the rows have different timestamps.

这里的问题是您不能执行 PARTITION BY value_column 。因为 PARTITION BY 会强制先订购。因此,在分组之前,表将由 value_column 进行排序,并且不再由时间戳进行排序。

The problem here is that you cannot do a PARTITION BY value_column. Because PARTITION BY forces an ordering first. So your table would be ordered by the value_column before the grouping and is not ordered by the timestamp anymore. This yields in results you are not expecting.

更笼统地说:问题是要确保特殊排序,即使有序列不属于所创建的列分区

示例:

db<>小提琴

我有下表:

ts      val
100000  50
130100  30050
160100  60050 
190200  100
220200  30100 
250200  30100 
300000  300
500000  100
550000  1000  
600000  1000
650000  2000  
700000  2000
720000  2000
750000  300

I有一个问题,我必须将列 val 的所有绑定值分组。但是我想按 ts 的价格保留订单。为此,我想为每个 val

I had the problem that I had to group all tied values of the column val. But I wanted to hold the order by ts. To achieve this I wanted to add a column with a unique ID per val group

添加一列具有唯一ID的预期结果:

ts      val     group
100000  50      1
130100  30050   2
160100  60050   3
190200  100     4
220200  30100   5     \ same group
250200  30100   5     /
300000  300     6
500000  100     7
550000  1000    8     \ same group
600000  1000    8     /
650000  2000    9     \
700000  2000    9     | same group
720000  2000    9     /
750000  300     10

第一试试是使用 rank 窗口函数,该函数通常可以完成此工作:

First try was the use of the rank window function which would do this job normally:

SELECT 
    *,
    rank() OVER (PARTITION BY val ORDER BY ts)
FROM 
    test

但是在这种情况下这是行不通的,因为 PARTITION BY 子句首先将表排序按其分区列(在这种情况下为 val ),然后按其 ORDER BY 列。因此,订单的价格为 val,ts ,而不是预期的订单为 ts 。因此结果当然不是预期的结果。

But in this case this doesn't work because the PARTITION BY clause orders the table first by its partition columns (val in this case) and then by its ORDER BY columns. So the order is by val, ts instead of the expected order by ts. So the result was not the expected one of course.

ts       val     rank
100000   50      1
190200   100     1
500000   100     2
300000   300     1
750000   300     2
550000   1000    1
600000   1000    2
650000   2000    1
700000   2000    2
720000   2000    3
130100   30050   1
220200   30100   1
250200   30100   2
160100   60050   1

问题是:如何通过 ts 来获取有关订单的组ID strong>?

The question is: How to get the group ids with respect to the order by ts?

编辑:我在下面添加了自己的解决方案,但是我对此感到非常不舒服它。似乎太复杂了。 我想知道是否有更好的方法来达到这个结果。

I added an own solution below but I feel very uncomfortable with it. It seems way too complicated. I was wondering if there's a better way to achieve this result.

推荐答案

我想到了这个我自己的解决方案(希望别人会得到更好的):

I came up with this solution by myself (hoping someone else will get a better one):

demo:db<> fiddle


  1. order通过 ts

  2. 给出下一个 val 值,并带有 lag 窗口函数( https: //www.postgresql.org/docs/current/static/tutorial-window.html

  3. 检查下一个和当前值是否相同。然后我可以打印出 0 1

  4. 这些总和有序 SUM 的值。这将生成我要寻找的组。他们将 val 列分组,但是通过 ts 列确保排序。

  1. order by ts
  2. give out the next val value with the lag window function (https://www.postgresql.org/docs/current/static/tutorial-window.html)
  3. check if the next and the current values are the same. Then I can print out a 0 or a 1
  4. sum up these values with an ordered SUM. This generates the groups I am looking for. They group the val column but ensure the ordering by the ts column.

查询:

SELECT 
    *, 
    SUM(is_diff) OVER (ORDER BY ts) 
FROM (
    SELECT 
        *,
        CASE WHEN val = lag(val) over (order by ts) THEN 0 ELSE 1 END as is_diff
    FROM test 
)s

结果:

ts       val     is_diff   sum
100000   50      1         1
130100   30050   1         2
160100   60050   1         3
190200   100     1         4
220200   30100   1         5    \ group
250200   30100   0         5    /
300000   300     1         6
500000   100     1         7
550000   1000    1         8    \ group
600000   1000    0         8    /
650000   2000    1         9    \
700000   2000    0         9    | group
720000   2000    0         9    /
750000   300     1         10

这篇关于窗口函数:在第一列之后按另一列划分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆