优化 GROUP BY 查询以检索每个用户的最新行 [英] Optimize GROUP BY query to retrieve latest row per user

查看:41
本文介绍了优化 GROUP BY 查询以检索每个用户的最新行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Postgres 9.2 中有以下用户消息日志表(简化形式):

创建表日志(日志日期日期,用户 ID 整数,有效载荷整数);

每个用户每天最多包含一条记录.在 300 天内每天将有大约 50 万条记录.每个用户的有效负载都在不断增加(如果这很重要).

我想在特定日期之前有效地检索每个用户的最新记录.我的查询是:

SELECT user_id, max(log_date), max(payload)从日志WHERE log_date <= :mydateGROUP BY user_id

这非常慢.我也试过:

SELECT DISTINCT ON(user_id), log_date, payload从日志WHERE log_date <= :mydateORDER BY user_id, log_date DESC;

具有相同的计划并且同样缓慢.

到目前为止,我在 log(log_date) 上只有一个索引,但没有太大帮助.

我有一个包含所有用户的 users 表.我还想检索某些用户的结果(那些具有 payload > :value 的用户).

我应该使用其他任何索引来加快速度,或通过其他任何方式来实现我想要的吗?

解决方案

为了获得最佳读取性能,您需要一个 多列索引:

创建索引 log_combo_idxON log (user_id, log_date DESC NULLS LAST);

要使仅索引扫描成为可能,请添加否则在 covering 中不需要列 payloadindexINCLUDE 子句(Postgres 11 或更高版本):

创建索引 log_combo_covering_idxON log (user_id, log_date DESC NULLS LAST) INCLUDE (payload);

见:

旧版本的回退:

创建索引 log_combo_covering_idxON log (user_id, log_date DESC NULLS LAST, payload);

为什么DESC NULLS LAST?

对于每个 user_id 或小表的 几行 DISTINCT ON 通常是最快和最简单的:

对于每个 user_id许多 行,索引跳过扫描(或松散索引扫描)(更)效率更高.这在 Postgres 12 之前还没有实现 - 正在为 Postgres 14 工作.但是有一些方法可以有效地模仿它.

常用表表达式需要 Postgres 8.4+.
LATERAL 需要Postgres 9.3+.
以下解决方案超出了 Postgres Wiki 中涵盖的内容.>

1.没有具有唯一用户的单独表

使用单独的users 表,下面2. 中的解决方案通常更简单、更快捷.跳过.

1a.LATERAL 加入的递归 CTE

WITH RECURSIVE cte AS (( -- 需要括号选择用户 ID、日志日期、有效负载从日志WHERE log_date <= :mydateORDER BY user_id, log_date DESC NULLS LAST限制 1)联合所有选择 l.*来自 ctec横向交叉连接(选择 l.user_id、l.log_date、l.payload从日志 l哪里 l.user_id >c.user_id -- 横向参考AND log_date <= :mydate -- 重复条件ORDER BY l.user_id, l.log_date DESC NULLS LAST限制 1) l)表 cte按 user_id 排序;

这很容易检索任意列,并且可能是当前 Postgres 中最好的.在下面的 2a. 章中有更多解释.

1b.具有相关子查询的递归 CTE

WITH RECURSIVE cte AS (( -- 需要括号SELECT l AS my_row -- 整行从日志 lWHERE log_date <= :mydateORDER BY user_id, log_date DESC NULLS LAST限制 1)联合所有SELECT (SELECT l -- 整行从日志 l哪里 l.user_id >(c.my_row).user_idAND l.log_date <= :mydate -- 重复条件ORDER BY l.user_id, l.log_date DESC NULLS LAST限制 1)来自 ctecWHERE (c.my_row).user_id 不是 NULL -- 注意括号)SELECT (my_row).* -- 分解行发件人WHERE (my_row).user_id 不为空ORDER BY (my_row).user_id;

方便检索单列整行.该示例使用表的整个行类型.其他变体也是可能的.

要断言在前一次迭代中找到一行,请测试单个 NOT NULL 列(如主键).

在第 2b 章中对该查询的更多解释.

相关:

2.使用单独的 users

只要保证每个相关的 user_id 正好有一行,表格布局就几乎无关紧要.示例:

创建表用户(user_id 串行 PRIMARY KEY, 用户名文本 NOT NULL);

理想情况下,该表的物理排序与 log 表同步.见:

或者它足够小(低基数)以至于无关紧要.否则,对查询中的行进行排序有助于进一步优化性能.见梁刚的补充.如果users表的物理排序顺序恰好与索引匹配在 log 上,这可能无关紧要.

2a.LATERAL 加入

SELECT u.user_id, l.log_date, l.payload来自用户你横向交叉连接(选择 l.log_date, l.payload从日志 lWHERE l.user_id = u.user_id -- 横向参考AND l.log_date <= :mydateORDER BY l.log_date DESC NULLS LAST限制 1) l;

JOIN LATERAL 允许引用在同一查询级别上的 FROM 项之前.见:

导致每个用户只查找一个索引.

不为 users 表中缺少的用户返回任何行.通常,强制引用完整性的外键约束会排除这种情况.

此外,在 log 中没有匹配条目的用户没有行 - 符合原始问题.要将这些用户保留在结果中,请使用 LEFT JOIN LATERAL ... ON true 而不是 CROSS JOIN LATERAL:

使用 LIMIT n 而不是 LIMIT 1 来检索多于一行(但不是全部)用户.

实际上,所有这些都是一样的:

JOIN LATERAL ... ON true交叉连接横向..., 横向...

不过,最后一个优先级较低.显式 JOIN 绑定在逗号之前.对于更多的连接表,这种细微的差异可能很重要.见:

2b.相关子查询

单行中检索单列的好选择.代码示例:

同样适用于多列,但您需要更多的智慧:

CREATE TEMP TABLE 组合(日志日期日期,有效载荷整数);SELECT user_id, (combo1).* -- 注意括号从 (选择 u.user_id, (SELECT (l.log_date, l.payload)::combo从日志 l哪里 l.user_id = u.user_idAND l.log_date <= :mydateORDER BY l.log_date DESC NULLS LAST限制 1) 作为组合 1来自用户你) 子;

与上面的 LEFT JOIN LATERAL 一样,此变体包括 所有 用户,即使 log 中没有条目.您将获得 combo1NULL,如果需要,您可以在外部查询中使用 WHERE 子句轻松过滤.
Nitpick:在外部查询中,您无法区分子查询是否未找到行或所有列值碰巧为 NULL - 结果相同.您需要在子查询中使用 NOT NULL 列以避免这种歧义.

相关子查询只能返回一个单个值.您可以将多个列包装成一个复合类型.但是为了稍后分解它,Postgres 需要一个众所周知的复合类型.匿名记录只能分解提供列定义列表.
使用注册类型,如现有表的行类型.或者使用 CREATE TYPE 显式(并永久)注册复合类型.或者创建一个临时表(在会话结束时自动删除)来临时注册其行类型.转换语法:(log_date, payload)::combo

最后,我们不想在同一查询级别上分解 combo1.由于查询规划器的弱点,这将对每列评估一次子查询(在 Postgres 12 中仍然如此).相反,将其设为子查询并在外部查询中分解.

相关:

使用 100k 日志条目和 1k 用户演示所有 4 个查询:
db<>fiddle 这里 - 第 11 页
sqlfiddle

I have the following log table for user messages (simplified form) in Postgres 9.2:

CREATE TABLE log (
    log_date DATE,
    user_id  INTEGER,
    payload  INTEGER
);

It contains up to one record per user and per day. There will be approximately 500K records per day for 300 days. payload is ever increasing for each user (if that matters).

I want to efficiently retrieve the latest record for each user before a specific date. My query is:

SELECT user_id, max(log_date), max(payload) 
FROM log 
WHERE log_date <= :mydate 
GROUP BY user_id

which is extremely slow. I have also tried:

SELECT DISTINCT ON(user_id), log_date, payload
FROM log
WHERE log_date <= :mydate
ORDER BY user_id, log_date DESC;

which has the same plan and is equally slow.

So far I have a single index on log(log_date), but doesn't help much.

And I have a users table with all users included. I also want to retrieve the result for some some users (those with payload > :value).

Is there any other index I should use to speed this up, or any other way to achieve what I want?

解决方案

For best read performance you need a multicolumn index:

CREATE INDEX log_combo_idx
ON log (user_id, log_date DESC NULLS LAST);

To make index only scans possible, add the otherwise not needed column payload in a covering index with the INCLUDE clause (Postgres 11 or later):

CREATE INDEX log_combo_covering_idx
ON log (user_id, log_date DESC NULLS LAST) INCLUDE (payload);

See:

Fallback for older versions:

CREATE INDEX log_combo_covering_idx
ON log (user_id, log_date DESC NULLS LAST, payload);

Why DESC NULLS LAST?

For few rows per user_id or small tables DISTINCT ON is typically fastest and simplest:

For many rows per user_id an index skip scan (or loose index scan) is (much) more efficient. That's not implemented up to Postgres 12 - work is ongoing for Postgres 14. But there are ways to emulate it efficiently.

Common Table Expressions require Postgres 8.4+.
LATERAL requires Postgres 9.3+.
The following solutions go beyond what's covered in the Postgres Wiki.

1. No separate table with unique users

With a separate users table, solutions in 2. below are typically simpler and faster. Skip ahead.

1a. Recursive CTE with LATERAL join

WITH RECURSIVE cte AS (
   (                                -- parentheses required
   SELECT user_id, log_date, payload
   FROM   log
   WHERE  log_date <= :mydate
   ORDER  BY user_id, log_date DESC NULLS LAST
   LIMIT  1
   )
   UNION ALL
   SELECT l.*
   FROM   cte c
   CROSS  JOIN LATERAL (
      SELECT l.user_id, l.log_date, l.payload
      FROM   log l
      WHERE  l.user_id > c.user_id  -- lateral reference
      AND    log_date <= :mydate    -- repeat condition
      ORDER  BY l.user_id, l.log_date DESC NULLS LAST
      LIMIT  1
      ) l
   )
TABLE  cte
ORDER  BY user_id;

This is simple to retrieve arbitrary columns and probably best in current Postgres. More explanation in chapter 2a. below.

1b. Recursive CTE with correlated subquery

WITH RECURSIVE cte AS (
   (                                           -- parentheses required
   SELECT l AS my_row                          -- whole row
   FROM   log l
   WHERE  log_date <= :mydate
   ORDER  BY user_id, log_date DESC NULLS LAST
   LIMIT  1
   )
   UNION ALL
   SELECT (SELECT l                            -- whole row
           FROM   log l
           WHERE  l.user_id > (c.my_row).user_id
           AND    l.log_date <= :mydate        -- repeat condition
           ORDER  BY l.user_id, l.log_date DESC NULLS LAST
           LIMIT  1)
   FROM   cte c
   WHERE  (c.my_row).user_id IS NOT NULL       -- note parentheses
   )
SELECT (my_row).*                              -- decompose row
FROM   cte
WHERE  (my_row).user_id IS NOT NULL
ORDER  BY (my_row).user_id;

Convenient to retrieve a single column or the whole row. The example uses the whole row type of the table. Other variants are possible.

To assert a row was found in the previous iteration, test a single NOT NULL column (like the primary key).

More explanation for this query in chapter 2b. below.

Related:

2. With separate users table

Table layout hardly matters as long as exactly one row per relevant user_id is guaranteed. Example:

CREATE TABLE users (
   user_id  serial PRIMARY KEY
 , username text NOT NULL
);

Ideally, the table is physically sorted in sync with the log table. See:

Or it's small enough (low cardinality) that it hardly matters. Else, sorting rows in the query can help to further optimize performance. See Gang Liang's addition. If the physical sort order of the users table happens to match the index on log, this may be irrelevant.

2a. LATERAL join

SELECT u.user_id, l.log_date, l.payload
FROM   users u
CROSS  JOIN LATERAL (
   SELECT l.log_date, l.payload
   FROM   log l
   WHERE  l.user_id = u.user_id         -- lateral reference
   AND    l.log_date <= :mydate
   ORDER  BY l.log_date DESC NULLS LAST
   LIMIT  1
   ) l;

JOIN LATERAL allows to reference preceding FROM items on the same query level. See:

Results in one index (-only) look-up per user.

Returns no row for users missing in the users table. Typically, a foreign key constraint enforcing referential integrity would rule that out.

Also, no row for users without matching entry in log - conforming to the original question. To keep those users in the result use LEFT JOIN LATERAL ... ON true instead of CROSS JOIN LATERAL:

Use LIMIT n instead of LIMIT 1 to retrieve more than one rows (but not all) per user.

Effectively, all of these do the same:

JOIN LATERAL ... ON true
CROSS JOIN LATERAL ...
, LATERAL ...

The last one has lower priority, though. Explicit JOIN binds before comma. That subtle difference can matters with more join tables. See:

2b. Correlated subquery

Good choice to retrieve a single column from a single row. Code example:

The same is possible for multiple columns, but you need more smarts:

CREATE TEMP TABLE combo (log_date date, payload int);

SELECT user_id, (combo1).*              -- note parentheses
FROM (
   SELECT u.user_id
        , (SELECT (l.log_date, l.payload)::combo
           FROM   log l
           WHERE  l.user_id = u.user_id
           AND    l.log_date <= :mydate
           ORDER  BY l.log_date DESC NULLS LAST
           LIMIT  1) AS combo1
   FROM   users u
   ) sub;

Like LEFT JOIN LATERAL above, this variant includes all users, even without entries in log. You get NULL for combo1, which you can easily filter with a WHERE clause in the outer query if need be.
Nitpick: in the outer query you can't distinguish whether the subquery didn't find a row or all column values happen to be NULL - same result. You need a NOT NULL column in the subquery to avoid this ambiguity.

A correlated subquery can only return a single value. You can wrap multiple columns into a composite type. But to decompose it later, Postgres demands a well-known composite type. Anonymous records can only be decomposed providing a column definition list.
Use a registered type like the row type of an existing table. Or register a composite type explicitly (and permanently) with CREATE TYPE. Or create a temporary table (dropped automatically at end of session) to register its row type temporarily. Cast syntax: (log_date, payload)::combo

Finally, we do not want to decompose combo1 on the same query level. Due to a weakness in the query planner this would evaluate the subquery once for each column (still true in Postgres 12). Instead, make it a subquery and decompose in the outer query.

Related:

Demonstrating all 4 queries with 100k log entries and 1k users:
db<>fiddle here - pg 11
Old sqlfiddle

这篇关于优化 GROUP BY 查询以检索每个用户的最新行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆