窗口函数:在第一列之后按另一列划分 [英] Window functions: PARTITION BY one column after ORDER BY another
问题描述
免责声明:所示问题比我最初预期的要普遍得多。下面的示例来自一个解决方案,另一个问题。但是,现在我拿这个样本来解决更多问题-主要与时间序列有关(请看一下右侧栏中的链接部分)。
Disclaimer: The shown problem is much more general than I expected first. The example below is taken from a solution to another question. But now I was taking this sample for solving many problems more - mostly related to time series (have a look at the "Linked" section in the right bar).
所以我先尝试更笼统地解释这个问题:
So I am trying to explain the problem more generally first:
我正在使用PostgreSQL,但是我确信这个问题在支持DBMS的其他窗口功能(MS SQL Server,Oracle等)中也存在。
窗口函数可用于将某些值组合在一起通过共同的属性或值。例如,您可以按日期对行进行分组。然后,您可以计算每个日期内的最大值,平均值或计数行数之类的值。
Window functions can be used to group certain values together by a common attribute or value. For example you can group rows by a date. Then you are able to calculate the max value within every single date or an average value or counting rows or whatever.
这可以通过定义 PARTITION
。按日期分组将与 PARTITION BY date_column
一起使用。现在,您想执行一个需要在组内进行特殊排序的操作(计算行号或汇总一列)。可以使用 PARTITON BY date_column ORDER BY an_attribute_column
来完成。
This can be achieved by defining a PARTITION
. Grouping by dates would work with PARTITION BY date_column
. Now you want to do an operation which needs a special order within your groups (calculating row numbers or sum up a column). This can be done with PARTITON BY date_column ORDER BY an_attribute_column
.
现在考虑更好地解决时间序列问题。如果没有日期但是有时间戳怎么办。然后,您将无法再按时间列进行分组。但是,尽管如此,按添加顺序分析数据可能很重要(也许时间戳就是数据集的创建时间)。然后,您意识到一些连续的行具有相同的值,并且您希望通过该公用值对数据进行分组。但是,线索是这些行具有不同的时间戳。
Now think about a finer resolution of time series. What if you do not have dates but timestamps. Then you cannot group by the time column anymore. But nevertheless it might be important to analyse the data in the order they were added (maybe the timestamp is the creating time of your data set). Then you realize that some consecutive rows have the same value and you want to group your data by this common value. But the clue is that the rows have different timestamps.
这里的问题是您不能执行 PARTITION BY value_column
。因为 PARTITION BY
会强制先订购。因此,在分组之前,表将由 value_column
进行排序,并且不再由时间戳进行排序。
The problem here is that you cannot do a PARTITION BY value_column
. Because PARTITION BY
forces an ordering first. So your table would be ordered by the value_column
before the grouping and is not ordered by the timestamp anymore. This yields in results you are not expecting.
更笼统地说:问题是要确保特殊排序,即使有序列不属于所创建的列分区。
示例:
我有下表:
ts val
100000 50
130100 30050
160100 60050
190200 100
220200 30100
250200 30100
300000 300
500000 100
550000 1000
600000 1000
650000 2000
700000 2000
720000 2000
750000 300
I有一个问题,我必须将列 val
的所有绑定值分组。但是我想按 ts
的价格保留订单。为此,我想为每个 val
组
I had the problem that I had to group all tied values of the column val
. But I wanted to hold the order by ts
. To achieve this I wanted to add a column with a unique ID per val
group
添加一列具有唯一ID的预期结果:
ts val group
100000 50 1
130100 30050 2
160100 60050 3
190200 100 4
220200 30100 5 \ same group
250200 30100 5 /
300000 300 6
500000 100 7
550000 1000 8 \ same group
600000 1000 8 /
650000 2000 9 \
700000 2000 9 | same group
720000 2000 9 /
750000 300 10
第一试试是使用 rank
窗口函数,该函数通常可以完成此工作:
First try was the use of the rank
window function which would do this job normally:
SELECT
*,
rank() OVER (PARTITION BY val ORDER BY ts)
FROM
test
但是在这种情况下这是行不通的,因为 PARTITION BY
子句首先将表排序按其分区列(在这种情况下为 val
),然后按其 ORDER BY
列。因此,订单的价格为 val,ts
,而不是预期的订单为 ts
。因此结果当然不是预期的结果。
But in this case this doesn't work because the PARTITION BY
clause orders the table first by its partition columns (val
in this case) and then by its ORDER BY
columns. So the order is by val, ts
instead of the expected order by ts
. So the result was not the expected one of course.
ts val rank
100000 50 1
190200 100 1
500000 100 2
300000 300 1
750000 300 2
550000 1000 1
600000 1000 2
650000 2000 1
700000 2000 2
720000 2000 3
130100 30050 1
220200 30100 1
250200 30100 2
160100 60050 1
问题是:如何通过 ts
来获取有关订单的组ID strong>?
The question is: How to get the group ids with respect to the order by ts
?
编辑:我在下面添加了自己的解决方案,但是我对此感到非常不舒服它。似乎太复杂了。 我想知道是否有更好的方法来达到这个结果。
I added an own solution below but I feel very uncomfortable with it. It seems way too complicated. I was wondering if there's a better way to achieve this result.
推荐答案
我想到了这个我自己的解决方案(希望别人会得到更好的):
I came up with this solution by myself (hoping someone else will get a better one):
- order通过
ts
- 给出下一个
val
值,并带有lag
窗口函数( https: //www.postgresql.org/docs/current/static/tutorial-window.html ) - 检查下一个和当前值是否相同。然后我可以打印出
0
或1
- 这些总和有序
SUM
的值。这将生成我要寻找的组。他们将val
列分组,但是通过ts
列确保排序。
- order by
ts
- give out the next
val
value with thelag
window function (https://www.postgresql.org/docs/current/static/tutorial-window.html) - check if the next and the current values are the same. Then I can print out a
0
or a1
- sum up these values with an ordered
SUM
. This generates the groups I am looking for. They group theval
column but ensure the ordering by thets
column.
查询:
SELECT
*,
SUM(is_diff) OVER (ORDER BY ts)
FROM (
SELECT
*,
CASE WHEN val = lag(val) over (order by ts) THEN 0 ELSE 1 END as is_diff
FROM test
)s
结果:
ts val is_diff sum
100000 50 1 1
130100 30050 1 2
160100 60050 1 3
190200 100 1 4
220200 30100 1 5 \ group
250200 30100 0 5 /
300000 300 1 6
500000 100 1 7
550000 1000 1 8 \ group
600000 1000 0 8 /
650000 2000 1 9 \
700000 2000 0 9 | group
720000 2000 0 9 /
750000 300 1 10
这篇关于窗口函数:在第一列之后按另一列划分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!