PostgreSQL窗口功能:通过比较进行分区 [英] PostgreSQL window function: partition by comparison

查看:83
本文介绍了PostgreSQL窗口功能:通过比较进行分区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在PostgreSQL查询的WINDOW函数中找到与PARTITION BY子句中的当前行进行比较的方法。

I'm trying to find the way of doing a comparison with the current row in the PARTITION BY clause in a WINDOW function in PostgreSQL query.

想象一下我有以下查询的这5个元素的简短列表(实际上,我有成千上万的行)。我试图为每一行获取下一个不同元素(事件列)的ID,以及上一个不同元素的ID。

Imagine I have the short list in the following query of this 5 elements (in the real case, I have thousands or even millions of rows). I am trying to get for each row, the id of the next different element (event column), and the id of the previous different element.

WITH events AS(
  SELECT 1 as id, 12 as event, '2014-03-19 08:00:00'::timestamp as date
  UNION SELECT 2 as id, 12 as event, '2014-03-19 08:30:00'::timestamp as date
  UNION SELECT 3 as id, 13 as event, '2014-03-19 09:00:00'::timestamp as date
  UNION SELECT 4 as id, 13 as event, '2014-03-19 09:30:00'::timestamp as date
  UNION SELECT 5 as id, 12 as event, '2014-03-19 10:00:00'::timestamp as date
)
SELECT lag(id)  over w as previous_different, event
     , lead(id) over w as next_different
FROM events ev
WINDOW w AS (PARTITION BY event!=ev.event ORDER BY date ASC);

我知道比较 event!= ev.event 不正确,但这就是我要达到的目的。

I know the comparison event!=ev.event is incorrect but that's the point I want to reach.

我得到的结果是(与删除PARTITION BY子句相同):

The result I get is (the same as if I delete the PARTITION BY clause):

 |12|2
1|12|3
2|13|4
3|13|5
4|12|

我想得到的结果是:

 |12|3
 |12|3
2|13|5
2|13|5
4|12|

任何人都知道这是否可行以及如何实现?非常感谢!

Anyone knows if it is possible and how? Thank you very much!

编辑:我知道我可以使用两个 JOIN s, ORDER BY DISTINCT ON ,但在数百万行的实际情况下,效率非常低:

I know I can do it with two JOINs, a ORDER BY and a DISTINCT ON, but in the real case of millions of rows it is very inefficient:

WITH events AS(
  SELECT 1 as id, 12 as event, '2014-03-19 08:00:00'::timestamp as date
  UNION SELECT 2 as id, 12 as event, '2014-03-19 08:30:00'::timestamp as date
  UNION SELECT 3 as id, 13 as event, '2014-03-19 09:00:00'::timestamp as date
  UNION SELECT 4 as id, 13 as event, '2014-03-19 09:30:00'::timestamp as date
  UNION SELECT 5 as id, 12 as event, '2014-03-19 10:00:00'::timestamp as date
)
SELECT DISTINCT ON (e.id, e.date) e1.id, e.event, e2.id
FROM events e
LEFT JOIN events e1 ON (e1.date<=e.date AND e1.id!=e.id AND e1.event!=e.event) 
LEFT JOIN events e2 ON (e2.date>=e.date AND e2.id!=e.id AND e2.event!=e.event) 
ORDER BY e.date ASC, e.id ASC, e1.date DESC, e1.id DESC, e2.date ASC, e2.id ASC


推荐答案

使用几个不同的窗口函数和两个子查询,这应该体面地快速工作:

Using several different window functions and two subqueries, this should work decently fast:

WITH events(id, event, ts) AS (
  VALUES
   (1, 12, '2014-03-19 08:00:00'::timestamp)
  ,(2, 12, '2014-03-19 08:30:00')
  ,(3, 13, '2014-03-19 09:00:00')
  ,(4, 13, '2014-03-19 09:30:00')
  ,(5, 12, '2014-03-19 10:00:00')
   )
SELECT first_value(pre_id)  OVER (PARTITION BY grp ORDER BY ts)      AS pre_id
     , id, ts
     , first_value(post_id) OVER (PARTITION BY grp ORDER BY ts DESC) AS post_id
FROM  (
   SELECT *, count(step) OVER w AS grp
   FROM  (
      SELECT id, ts
           , NULLIF(lag(event) OVER w, event) AS step
           , lag(id)  OVER w AS pre_id
           , lead(id) OVER w AS post_id
      FROM   events
      WINDOW w AS (ORDER BY ts)
      ) sub1
   WINDOW w AS (ORDER BY ts)
   ) sub2
ORDER  BY ts;

使用 ts 作为时间戳列的名称。

假定 ts 是唯一的-并且 已编入索引 唯一约束会自动执行此操作)。

Using ts as name for the timestamp column.
Assuming ts to be unique - and indexed (a unique constraint does that automatically).

在具有5万行的真实表的测试中,只需要单次索引扫描。因此,即使有大桌子也应该相当快。相比之下,具有join / distinct的查询在一分钟后(如预期)没有完成。

即使是优化版本,一次也处理一次交叉联接(左联接几乎没有限制条件是

In a test with a real life table with 50k rows it only needed a single index scan. So, should be decently fast even with big tables. In comparison, your query with join / distinct did not finish after a minute (as expected).
Even an optimized version, dealing with one cross join at a time (the left join with hardly a limiting condition is effectively a limited cross join) did not finish after a minute.

要获得大表的最佳性能,请调整内存设置,尤其是 work_mem (用于大型排序操作)。如果可以节省RAM,请考虑暂时为会话设置更高的值。在此处此处

For best performance with a big table, tune your memory settings, in particular for work_mem (for big sort operations). Consider setting it (much) higher for your session temporarily if you can spare the RAM. Read more here and here.


  1. 在子查询 sub1 中,仅查看上一行的事件保留该值(如果已更改),从而标记新组的第一个元素。同时,获取上一行和下一行的 id pre_id post_id )。

  1. In subquery sub1 look at the event from the previous row and only keep that if it has changed, thus marking the first element of a new group. At the same time, get the id of the previous and the next row (pre_id, post_id).

在子查询 sub2 中, count()仅计算非空值。产生的 grp 在连续的相同事件块中标记同伴。

In subquery sub2, count() only counts non-null values. The resulting grp marks peers in blocks of consecutive same events.

最后一个 SELECT ,每组每行取第一个 pre_id 和最后一个 post_id 以获得所需的结果。

实际上,在外部 SELECT 中应该更快:

In the final SELECT, take the first pre_id and the last post_id per group for each row to arrive at the desired result.
Actually, this should be even faster in the outer SELECT:

 last_value(post_id) OVER (PARTITION BY grp ORDER BY ts
                           RANGE BETWEEN UNBOUNDED PRECEDING
                                 AND     UNBOUNDED FOLLOWING) AS post_id

...因为窗口的排序顺序与 pre_id ,因此只需要一种排序。快速测试似乎可以确认这一点。 有关此框架定义的更多信息。

... since the sort order of the window agrees with the window for pre_id, so only a single sort is needed. A quick test seems to confirm it. More about this frame definition.

SQL小提琴。

这篇关于PostgreSQL窗口功能:通过比较进行分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆