优化GROUP BY查询以检索每个用户的最新记录 [英] Optimize GROUP BY query to retrieve latest record per user

查看:158
本文介绍了优化GROUP BY查询以检索每个用户的最新记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Postgres 9.2中有下表(简化形式)

I have the following table (simplified form) in Postgres 9.2

CREATE TABLE user_msg_log (
    aggr_date DATE,
    user_id INTEGER,
    running_total INTEGER
);

每个用户和每天最多包含一条记录。每天将有大约500,000条记录,为期300天。每个用户的running_total总是在增加。

It contains up to one record per user and per day. There will be approximately 500K records per day for 300 days. running_total is always increasing for each user.

我想在特定日期之前有效地检索每个用户的最新记录。我的查询是:

I want to efficiently retrieve the latest record for each user before a specific date. My query is:

SELECT user_id, max(aggr_date), max(running_total) 
FROM user_msg_log 
WHERE aggr_date <= :mydate 
GROUP BY user_id

这是非常慢的。我也尝试过:

which is extremely slow. I have also tried:

SELECT DISTINCT ON(user_id), aggr_date, running_total
FROM user_msg_log
WHERE aggr_date <= :mydate
ORDER BY user_id, aggr_date DESC;

具有相同的计划并同样缓慢。

which has the same plan and is equally slow.

到目前为止,我在user_msg_log(aggr_date)上有一个索引,但没有多大帮助。
我是否应该使用其他任何索引来加快速度,或以其他任何方式实现我的目标?

So far I have a single index on user_msg_log(aggr_date), but doesn't help much. Is there any other index I should use to speed this up, or any other way to achieve what I want?

推荐答案

为获得最佳阅读性能,您需要多列索引

For best read performance you need a multicolumn index:

CREATE INDEX user_msg_log_combo_idx
ON user_msg_log (user_id, aggr_date DESC NULLS LAST)

使 仅索引扫描 可能,添加其他不需要的列 running_total

CREATE INDEX user_msg_log_combo_covering_idx
ON user_msg_log (user_id, aggr_date DESC NULLS LAST, running_total)

为什么 DESC NULLS LAST

  • Unused index in range of dates query

每<$ c $ 少数 行c> user_id 或小表一个简单的 DISTINCT ON 是最快最简单的解决方案之一:

For few rows per user_id or small tables a simple DISTINCT ON is among the fastest and simplest solutions:

  • Select first row in each GROUP BY group?

许多 行每 user_id 松散索引扫描 会更高效。这在Postgres中没有实现(至少达到Postgres 10),但有一些方法可以模仿它:

For many rows per user_id a loose index scan would be (much) more efficient. That's not implemented in Postgres (at least up to Postgres 10), but there are ways to emulate it:

以下解决方案超出了 Postgres Wiki

使用单独的用户表,解决方案在 2。 下面通常更简单,更快。

The following solutions go beyond what's covered in the Postgres Wiki.
With a separate users table, solutions in 2. below are typically simpler and faster.

通用表格表达式需要Postgres 8.4 +

LATERAL 需要Postgres 9.3 +

Common Table Expressions require Postgres 8.4+.
LATERAL requires Postgres 9.3+.

WITH RECURSIVE cte AS (
   (  -- parentheses required
   SELECT user_id, aggr_date, running_total
   FROM   user_msg_log
   WHERE  aggr_date <= :mydate
   ORDER  BY user_id, aggr_date DESC NULLS LAST
   LIMIT  1
   )
   UNION ALL
   SELECT u.user_id, u.aggr_date, u.running_total
   FROM   cte c
   ,      LATERAL (
      SELECT user_id, aggr_date, running_total
      FROM   user_msg_log
      WHERE  user_id > c.user_id   -- lateral reference
      AND    aggr_date <= :mydate  -- repeat condition
      ORDER  BY user_id, aggr_date DESC NULLS LAST
      LIMIT  1
      ) u
   )
SELECT user_id, aggr_date, running_total
FROM   cte
ORDER  BY user_id;

这在Postgres的当前版本中更为可取,并且检索任意列很简单。在下面的 2a。章节中有更多解释。

This is preferable in current versions of Postgres and it's simple to retrieve arbitrary columns. More explanation in chapter 2a. below.

方便检索单列整行。该示例使用表的整行类型。其他变种是可能的。

Convenient to retrieve either a single column or the whole row. The example uses the whole row type of the table. Other variants are possible.

WITH RECURSIVE cte AS (
   (
   SELECT u  -- whole row
   FROM   user_msg_log u
   WHERE  aggr_date <= :mydate
   ORDER  BY user_id, aggr_date DESC NULLS LAST
   LIMIT  1
   )
   UNION ALL
   SELECT (SELECT u1  -- again, whole row
           FROM   user_msg_log u1
           WHERE  user_id > (c.u).user_id  -- parentheses to access row type
           AND    aggr_date <= :mydate     -- repeat predicate
           ORDER  BY user_id, aggr_date DESC NULLS LAST
           LIMIT  1)
   FROM   cte c
   WHERE  (c.u).user_id IS NOT NULL        -- any NOT NULL column of the row
   )
SELECT (u).*                               -- finally decompose row
FROM   cte
WHERE  (u).user_id IS NOT NULL             -- any column defined NOT NULL
ORDER  BY (u).user_id;

测试行值可能会误导cu is NOT NULL 。如果测试行的每一列都是 NOT NULL ,则只返回 true ,如果单个<$ c,则会失败$ c> NULL 包含值。 (我的答案中有一段时间我有这个错误。)相反,在前一次迭代中找到一个行,测试一行定义的行 NOT NULL (如主键)。更多:

It could be misleading to test the row value with c.u IS NOT NULL. This only returns true if every single column of the tested row is NOT NULL and would fail if a single NULL value is contained. (I had this bug in my answer for some time.) Instead, to assert a row was found in the previous iteration, test a single column of the row that is defined NOT NULL (like the primary key). More:

  • NOT NULL constraint over a set of columns
  • IS NOT NULL test for a record does not return TRUE when variable is set

下面 2b。一章中有关此查询的更多说明。

相关回答:

More explanation for this query in chapter 2b. below.
Related answers:

  • Query last N related rows per row
  • GROUP BY one column, while sorting by another in PostgreSQL

表布局几乎不重要,只要我们每个相关的行只有一行 USER_ID 。示例:

Table layout hardly matters as long as we have exactly one row per relevant user_id. Example:

CREATE TABLE users (
   user_id  serial PRIMARY KEY
 , username text NOT NULL
);

理想情况下,表格是物理排序的。请参阅:

Ideally, the table is physically sorted. See:

  • Optimize Postgres timestamp query range

或者它足够小(低基数),这几乎不重要。$
否则,对查询中的行进行排序有助于进一步优化性能。 查看Gang Liang的补充。

Or it's small enough (low cardinality) that it hardly matters.
Else, sorting rows in the query can help to further optimize performance. See Gang Liang's addition.

SELECT u.user_id, l.aggr_date, l.running_total
FROM   users u
CROSS  JOIN LATERAL (
   SELECT aggr_date, running_total
   FROM   user_msg_log
   WHERE  user_id = u.user_id  -- lateral reference
   AND    aggr_date <= :mydate
   ORDER  BY aggr_date DESC NULLS LAST
   LIMIT  1
   ) l;

JOIN LATERAL 允许引用前面的 FROM 同一查询级别的项目。每个用户只能查找一个索引(-only)。

JOIN LATERAL allows to reference preceding FROM items on the same query level. You get one index (-only) look-up per user.

  • What is the difference between LATERAL and a subquery in PostgreSQL?

通过对用户表进行排序来考虑可能的改进梁亮在另一个答案中提出建议。如果 users 表的物理排序顺序恰好与 user_msg_log 上的索引匹配,则不需要此项。

Consider the possible improvement by sorting the users table suggested by Gang Liang in another answer. If the physical sort order of the users table happens to match the index on user_msg_log, you don't need this.

即使您在<$中有条目,也无法获得用户表中缺少用户的结果C $ C> user_msg_log 。通常情况下,您会有一个外键约束强制参照完整性来规则。

You don't get results for users missing in the users table, even if you have entries in user_msg_log. Typically, you would have a foreign key constraint enforcing referential integrity to rule that out.

您也没有为任何用户获取一行在 user_msg_log 中没有匹配条目。这符合你原来的问题。如果您需要在结果中包含这些行,请使用 LEFT JOIN LATERAL ... ON true 而不是 CROSS JOIN LATERAL

You also don't get a row for any user that has no matching entry in user_msg_log. That conforms to your original question. If you need to include those rows in the result use LEFT JOIN LATERAL ... ON true instead of CROSS JOIN LATERAL:

  • Call a set-returning function with an array argument multiple times

此表单最适合检索每个用户多行(但不是全部)。只需使用 LIMIT n 而不是 LIMIT 1

This form is also best to retrieve more than one rows (but not all) per user. Just use LIMIT n instead of LIMIT 1.

实际上,所有这些都是相同的:

Effectively, all of these would do the same:

JOIN LATERAL ... ON true
CROSS JOIN LATERAL ...
, LATERAL ...

后者有一个但是,优先级较低。显式 JOIN 在逗号之前绑定。

The latter has a lower priority, though. Explicit JOIN binds before comma.

单行检索单列的不错选择。代码示例:

Good choice to retrieve a single column from a single row. Code example:

  • Optimize groupwise maximum query

多列也是如此,但你需要更多的智能:

The same is possible for multiple columns, but you need more smarts:

CREATE TEMP TABLE combo (aggr_date date, running_total int);

SELECT user_id, (my_combo).*  -- note the parentheses
FROM (
   SELECT u.user_id
        , (SELECT (aggr_date, running_total)::combo
           FROM   user_msg_log
           WHERE  user_id = u.user_id
           AND    aggr_date <= :mydate
           ORDER  BY aggr_date DESC NULLS LAST
           LIMIT  1) AS my_combo
   FROM   users u
   ) sub;




  • 喜欢 LEFT JOIN LATERAL 上面,此变体包括所有用户,即使没有 user_msg_log 中的条目。对于 my_combo ,您可以获得 NULL ,您可以使用 WHERE 子句,如果需要的话。

    Nitpick:在外部查询中你无法区分子查询是否没有找到行或者返回的所有值都是NULL - 结果相同。您必须在子查询中包含 NOT NULL 列才能确定。

    • Like LEFT JOIN LATERAL above, this variant includes all users, even without entries in user_msg_log. You get NULL for my_combo, which you can easily filter with a WHERE clause in the outer query if need be.
      Nitpick: in the outer query you can't distinguish whether the subquery didn't find a row or all values returned happen to be NULL - same result. You would have to include a NOT NULL column in the subquery to be sure.

      相关子查询只能返回单值。您可以将多个列包装为复合类型。但是为了稍后分解它,Postgres需要一种众所周知的复合类型。匿名记录只能在提供列定义列表的情况下进行分解。

      A correlated subquery can only return a single value. You can wrap multiple columns into a composite type. But to decompose it later, Postgres demands a well-known composite type. Anonymous records can only be decomposed providing a column definition list.

      使用已注册类型(如现有表的行类型)或创建类型。使用 CREATE TYPE 显式(和永久)注册复合类型,或创建临时表(在会话结束时自动删除)以临时提供行类型。转换为该类型:(aggr_date,running_total):: combo

      Use a registered type like the row type of an existing table, or create a type. Register a composite type explicitly (and permanently) with CREATE TYPE, or create a temporary table (dropped automatically at end of session) to provide a row type temporarily. Cast to that type: (aggr_date, running_total)::combo

      最后,我们不想要在同一查询级别分解 combo 。由于查询规划器的弱点,这将为每列评估子查询一次(直到Postgres 9.6 - 计划为Postgres 10进行改进)。相反,将它设为子查询并在外部查询中进行分解。

      Finally, we do not want to decompose combo on the same query level. Due to a weakness in the query planner this would evaluate the subquery once for each column (up to Postgres 9.6 - improvements are planned for Postgres 10). Instead, make it a subquery and decompose in the outer query.

      相关:

      • Get values from first and last row per group

      使用100k日志条目和1k演示所有4个查询用户:

      SQL Fiddle - pg 9.6

      db<>小提琴此处 - 第10页

      Demonstrating all 4 queries with 100k log entries and 1k users:
      SQL Fiddle - pg 9.6
      db<>fiddle here - pg 10

      这篇关于优化GROUP BY查询以检索每个用户的最新记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆