优化GROUP BY查询以检索每个用户的最新行 [英] Optimize GROUP BY query to retrieve latest row per user

查看:91
本文介绍了优化GROUP BY查询以检索每个用户的最新行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Postgres 9.2中,我有以下用于用户消息(简化形式)的日志表:

I have the following log table for user messages (simplified form) in Postgres 9.2:

CREATE TABLE log (
    log_date DATE,
    user_id  INTEGER,
    payload  INTEGER
);

每个用户每天最多包含一条记录.在300天之内,每天大约有50万条记录.每个用户的有效负载都在增加(如果有关系的话).

It contains up to one record per user and per day. There will be approximately 500K records per day for 300 days. payload is ever increasing for each user (if that matters).

我想有效地检索每个用户在特定日期之前的最新记录.我的查询是:

I want to efficiently retrieve the latest record for each user before a specific date. My query is:

SELECT user_id, max(log_date), max(payload) 
FROM log 
WHERE log_date <= :mydate 
GROUP BY user_id

,这非常慢.我也尝试过:

which is extremely slow. I have also tried:

SELECT DISTINCT ON(user_id), log_date, payload
FROM log
WHERE log_date <= :mydate
ORDER BY user_id, log_date DESC;

具有相同的计划,但速度同样慢.

which has the same plan and is equally slow.

到目前为止,我在log(log_date)上只有一个索引,但没有太大帮助.

So far I have a single index on log(log_date), but doesn't help much.

我有一个users表,其中包含所有用户.我还想为某些用户(使用payload > :value的用户)检索结果.

And I have a users table with all users included. I also want to retrieve the result for some some users (those with payload > :value).

我还有其他索引可以用来加快速度吗?还是可以通过其他任何方式来实现我想要的目标?

Is there any other index I should use to speed this up, or any other way to achieve what I want?

推荐答案

为获得最佳读取性能,您需要多列索引:

For best read performance you need a multicolumn index:

CREATE INDEX log_combo_idx
ON log (user_id, log_date DESC NULLS LAST);

要使 仅索引扫描 成为可能,请添加否则不需要覆盖索引中的列payload使用INCLUDE子句(Postgres 11或更高版本):

To make index only scans possible, add the otherwise not needed column payload in a covering index with the INCLUDE clause (Postgres 11 or later):

CREATE INDEX log_combo_covering_idx
ON log (user_id, log_date DESC NULLS LAST) INCLUDE (payload);

请参阅:

较早版本的Fallback:

Fallback for older versions:

CREATE INDEX log_combo_covering_idx
ON log (user_id, log_date DESC NULLS LAST, payload);

为什么DESC NULLS LAST?

对于每个user_id或小型表DISTINCT ON 行,通常最快,最简单:

For few rows per user_id or small tables DISTINCT ON is typically fastest and simplest:

对于每个user_id 许多 行, 索引跳过扫描(或松散索引扫描)效率更高.直到Postgres 12才实现该功能- Postgres 13的工作正在进行中.但是,有一些方法可以有效地对其进行仿真.

For many rows per user_id an index skip scan (or loose index scan) is (much) more efficient. That's not implemented up to Postgres 12 - work is ongoing for Postgres 13. But there are ways to emulate it efficiently.

公用表表达式需要Postgres 8.4 + .
LATERAL 需要Postgres 9.3 + .
以下解决方案超出了 Postgres Wiki 所涵盖的范围.

Common Table Expressions require Postgres 8.4+.
LATERAL requires Postgres 9.3+.
The following solutions go beyond what's covered in the Postgres Wiki.

使用单独的users表,下面 2.中的解决方案通常更简单,更快捷.向前跳.

With a separate users table, solutions in 2. below are typically simpler and faster. Skip ahead.

WITH RECURSIVE cte AS (
   (                                -- parentheses required
   SELECT user_id, log_date, payload
   FROM   log
   WHERE  log_date <= :mydate
   ORDER  BY user_id, log_date DESC NULLS LAST
   LIMIT  1
   )
   UNION ALL
   SELECT l.*
   FROM   cte c
   CROSS  JOIN LATERAL (
      SELECT l.user_id, l.log_date, l.payload
      FROM   log l
      WHERE  l.user_id > c.user_id  -- lateral reference
      AND    log_date <= :mydate    -- repeat condition
      ORDER  BY l.user_id, l.log_date DESC NULLS LAST
      LIMIT  1
      ) l
   )
TABLE  cte
ORDER  BY user_id;

这很容易检索任意列,并且在当前的Postgres中可能最好.在下面的 2a.章中有更多说明.

This is simple to retrieve arbitrary columns and probably best in current Postgres. More explanation in chapter 2a. below.

WITH RECURSIVE cte AS (
   (                                           -- parentheses required
   SELECT l AS my_row                          -- whole row
   FROM   log l
   WHERE  log_date <= :mydate
   ORDER  BY user_id, log_date DESC NULLS LAST
   LIMIT  1
   )
   UNION ALL
   SELECT (SELECT l                            -- whole row
           FROM   log l
           WHERE  l.user_id > (c.my_row).user_id
           AND    l.log_date <= :mydate        -- repeat condition
           ORDER  BY l.user_id, l.log_date DESC NULLS LAST
           LIMIT  1)
   FROM   cte c
   WHERE  (c.my_row).user_id IS NOT NULL       -- note parentheses
   )
SELECT (my_row).*                              -- decompose row
FROM   cte
WHERE  (my_row).user_id IS NOT NULL
ORDER  BY (my_row).user_id;

方便地检索单列整行.该示例使用表的整个行类型.其他变体也是可能的.

Convenient to retrieve a single column or the whole row. The example uses the whole row type of the table. Other variants are possible.

要断言在上一次迭代中发现一行,请测试单个NOT NULL列(如主键).

To assert a row was found in the previous iteration, test a single NOT NULL column (like the primary key).

第2b章中对此查询的更多说明.

相关:

  • Query last N related rows per row
  • GROUP BY one column, while sorting by another in PostgreSQL

只要保证每个相关user_id仅一行,表布局就无关紧要.示例:

Table layout hardly matters as long as exactly one row per relevant user_id is guaranteed. Example:

CREATE TABLE users (
   user_id  serial PRIMARY KEY
 , username text NOT NULL
);

理想情况下,该表在物理上与log表同步排序.参见:

Ideally, the table is physically sorted in sync with the log table. See:

或者它足够小(低基数)几乎没有关系.否则,对查询中的行进行排序可以帮助进一步优化性能. 请参见Gang Liang的补充内容.如果users表的物理排序顺序恰好与log上的索引匹配,这可能无关紧要.

Or it's small enough (low cardinality) that it hardly matters. Else, sorting rows in the query can help to further optimize performance. See Gang Liang's addition. If the physical sort order of the users table happens to match the index on log, this may be irrelevant.

SELECT u.user_id, l.log_date, l.payload
FROM   users u
CROSS  JOIN LATERAL (
   SELECT l.log_date, l.payload
   FROM   log l
   WHERE  l.user_id = u.user_id         -- lateral reference
   AND    l.log_date <= :mydate
   ORDER  BY l.log_date DESC NULLS LAST
   LIMIT  1
   ) l;

JOIN LATERAL 允许引用FROM之前的项目相同的查询级别.参见:

JOIN LATERAL allows to reference preceding FROM items on the same query level. See:

对每个用户进行一次索引(仅)查询.

Results in one index (-only) look-up per user.

对于users表中缺少的用户,不返回任何行.通常,强制引用完整性的外键约束将排除这种情况.

Returns no row for users missing in the users table. Typically, a foreign key constraint enforcing referential integrity would rule that out.

此外,对于没有log中匹配条目的用户,没有行-符合原始问题.要使这些用户留在结果中,请使用 LEFT JOIN LATERAL ... ON true 而不是CROSS JOIN LATERAL:

Also, no row for users without matching entry in log - conforming to the original question. To keep those users in the result use LEFT JOIN LATERAL ... ON true instead of CROSS JOIN LATERAL:

使用 LIMIT n 代替LIMIT 1来为每个用户检索多行(但不是全部).

Use LIMIT n instead of LIMIT 1 to retrieve more than one rows (but not all) per user.

有效地,所有这些都做相同的事情:

Effectively, all of these do the same:

JOIN LATERAL ... ON true
CROSS JOIN LATERAL ...
, LATERAL ...

最后一个优先级较低.显式JOIN在逗号前绑定.这种细微的差别可能与更多的联接表有关.参见:

The last one has lower priority, though. Explicit JOIN binds before comma. That subtle difference can matters with more join tables. See:

单行检索单列的好选择.代码示例:

Good choice to retrieve a single column from a single row. Code example:

多列也可以,但是您需要更多的技巧:

The same is possible for multiple columns, but you need more smarts:

CREATE TEMP TABLE combo (log_date date, payload int);

SELECT user_id, (combo1).*              -- note parentheses
FROM (
   SELECT u.user_id
        , (SELECT (l.log_date, l.payload)::combo
           FROM   log l
           WHERE  l.user_id = u.user_id
           AND    l.log_date <= :mydate
           ORDER  BY l.log_date DESC NULLS LAST
           LIMIT  1) AS combo1
   FROM   users u
   ) sub;

  • 像上面的LEFT JOIN LATERAL一样,此变体包括 all 个用户,即使在log中没有条目.您获得combo1NULL,可以根据需要在外部查询中轻松地使用WHERE子句进行过滤.
    Nitpick:在外部查询中,您无法区分子查询未找到行还是所有列值都碰巧为NULL-结果相同.子查询中需要一个NOT NULL列,以避免这种歧义.

    • Like LEFT JOIN LATERAL above, this variant includes all users, even without entries in log. You get NULL for combo1, which you can easily filter with a WHERE clause in the outer query if need be.
      Nitpick: in the outer query you can't distinguish whether the subquery didn't find a row or all column values happen to be NULL - same result. You need a NOT NULL column in the subquery to avoid this ambiguity.

      相关子查询只能返回单个值.您可以将多个列包装为复合类型.但是为了以后进行分解,Postgres需要一种众所周知的复合类型.仅提供列定义列表,才能分解匿名记录.
      使用注册类型,例如现有表的行类型.或使用CREATE TYPE显式(永久)注册复合类型.或创建一个临时表(在会话结束时自动删除)以临时注册其行类型.强制转换语法:(log_date, payload)::combo

      A correlated subquery can only return a single value. You can wrap multiple columns into a composite type. But to decompose it later, Postgres demands a well-known composite type. Anonymous records can only be decomposed providing a column definition list.
      Use a registered type like the row type of an existing table. Or register a composite type explicitly (and permanently) with CREATE TYPE. Or create a temporary table (dropped automatically at end of session) to register its row type temporarily. Cast syntax: (log_date, payload)::combo

      最后,我们不想在同一查询级别上分解combo1.由于查询计划器的弱点,这将为每个列评估一次子查询(在Postgres 12中仍然适用).而是使其成为子查询并在外部查询中分解.

      Finally, we do not want to decompose combo1 on the same query level. Due to a weakness in the query planner this would evaluate the subquery once for each column (still true in Postgres 12). Instead, make it a subquery and decompose in the outer query.

      相关:

      使用100k日志条目和1k用户演示所有4个查询:
      db<>小提琴此处 -第11页
      旧的 sqlfiddle -9.6页

      Demonstrating all 4 queries with 100k log entries and 1k users:
      db<>fiddle here - pg 11
      Old sqlfiddle - pg 9.6

      这篇关于优化GROUP BY查询以检索每个用户的最新行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆