每行查询最后 N 个相关行 [英] Query last N related rows per row

查看：26 发布时间：2021/12/21 23:48:33 sql performance postgresql indexing query-optimization

本文介绍了每行查询最后 N 个相关行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有以下查询，它获取每个 station 的最新 N observations 的 id:

SELECT id从 (SELECT station_id, id, created_at,row_number() OVER(PARTITION BY station_idORDER BY created_at DESC) AS rn从 (SELECT station_id, id, created_at来自观察) s) s哪里 rn <= #{n}ORDER BY station_id, created_at DESC;

我在 id、station_id、created_at 上有索引.

这是我想出的唯一解决方案，可以为每个站获取多于一条记录.但是它很慢(81000 条记录的表为 154.0 毫秒).

如何加快查询速度?

解决方案

索引

首先，多列索引会有所帮助:

创建索引观察_special_idxON 观察(station_id, created_at DESC, id)

created_at DESC 稍微好一些，但如果没有 DESC，索引仍然会以几乎相同的速度向后扫描.

假设created_at被定义为NOT NULL，否则在索引和查询中考虑DESC NULLS LAST:

按列 ASC 排序，但首先是 NULL 值?

最后一列 id 仅在您获得仅索引扫描，如果您不断添加大量新行，这可能不起作用.在这种情况下，从索引中删除 id.

更简单的查询(仍然很慢)

简化您的查询，内部子选择没有帮助:

SELECT id从  (SELECT station_id, id, created_at, row_number() OVER (PARTITION BY station_idORDER BY created_at DESC) AS rn来自观察) sWHERE rn <= #{n} -- 你的极限在这里ORDER BY station_id, created_at DESC;

应该快一点，但仍然很慢.

快速查询

假设您有相对少数个站点，并且每个站点有相对很多个观测站.
还假设 station_id id 定义为 NOT NULL.

要真正快速，您需要相当于松散索引扫描(尚未在 Postgres 中实现).相关答案:

优化 GROUP BY 查询以检索每个用户的最新行

如果您有一个单独的 stations 表(这似乎很可能)，您可以使用 JOIN LATERAL (Postgres 9.3+):

SELECT o.idFROM 站横向交叉连接(选择 o.idFROM 观察 oWHERE o.station_id = s.station_id -- 横向参考ORDER BY o.created_at DESCLIMIT #{n} -- 你的极限在这里) oORDER BY s.station_id, o.created_at DESC;

如果您没有站表，那么接下来最好的事情就是创建和维护一个.可能添加外键引用以强制关系完整性.

如果这不是一个选项，您可以即时提取这样的表.简单的选项是:

SELECT DISTINCT station_id FROM Observations;SELECT station_id FROM 观察 GROUP BY 1;

但两者都需要顺序扫描并且速度很慢.使 Postgres 使用上述索引(或任何以 station_id 作为前导列的 btree 索引)和 递归 CTE:

WITH RECURSIVE Station AS (( -- 额外的一对括号...选择 station_id来自观察按 station_id 排序限制 1)                   -  ... 是必须的！联合所有SELECT (SELECT o.station_idFROM 观察 oWHERE o.station_id >s.station_idORDER BY o.station_id限制 1)FROM 站WHERE s.station_id IS NOT NULL -- 作为中断条件)选择 station_id从车站WHERE station_id 不为空；-- 删除带有 NULL 的悬空行

将其用作上述简单查询中 stations 表的插入式替换:

WITH RECURSIVE Station AS ((选择 station_id来自观察按 station_id 排序限制 1)联合所有SELECT (SELECT o.station_idFROM 观察 oWHERE o.station_id >s.station_idORDER BY o.station_id限制 1)FROM 站WHERE s.station_id 不为空)选择 o.idFROM 站横向交叉连接(选择 o.id，o.created_atFROM 观察 oWHERE o.station_id = s.station_idORDER BY o.created_at DESCLIMIT #{n} -- 你的极限在这里) oWHERE s.station_id 不为空ORDER BY s.station_id, o.created_at DESC;

这应该仍然比您的速度快 数量级.

db<>fiddle 这里
_旧sqlfiddle

I have the following query which fetches the id of the latest N observations for each station:

SELECT id
FROM (
  SELECT station_id, id, created_at,
         row_number() OVER(PARTITION BY station_id
                           ORDER BY created_at DESC) AS rn
  FROM (
      SELECT station_id, id, created_at
      FROM observations
  ) s
) s
WHERE rn <= #{n}
ORDER BY station_id, created_at DESC;

I have indexes on id, station_id, created_at.

This is the only solution I have come up with that can fetch more than a single record per station. However it is quite slow (154.0 ms for a table of 81000 records).

How can I speed up the query?

解决方案

Index

First, a multicolumn index will help:

CREATE INDEX observations_special_idx
ON observations(station_id, created_at DESC, id)

created_at DESC is a slightly better fit, but the index would still be scanned backwards at almost the same speed without DESC.

Assuming created_at is defined NOT NULL, else consider DESC NULLS LAST in index and query:

Sort by column ASC, but NULL values first?

The last column id is only useful if you get an index-only scan out of it, which probably won't work if you add lots of new rows constantly. In this case, remove id from the index.

Simpler query (still slow)

Simplify your query, the inner subselect doesn't help:

SELECT id
FROM  (
  SELECT station_id, id, created_at
       , row_number() OVER (PARTITION BY station_id
                            ORDER BY created_at DESC) AS rn
  FROM   observations
  ) s
WHERE  rn <= #{n}  -- your limit here
ORDER  BY station_id, created_at DESC;

Should be a bit faster, but still slow.

Fast query

Assuming you have relatively few stations and relatively many observations per station.
Also assuming station_id id defined as NOT NULL.

To be really fast, you need the equivalent of a loose index scan (not implemented in Postgres, yet). Related answer:

Optimize GROUP BY query to retrieve latest row per user

If you have a separate table of stations (which seems likely), you can emulate this with JOIN LATERAL (Postgres 9.3+):

SELECT o.id
FROM   stations s
CROSS  JOIN LATERAL (
   SELECT o.id
   FROM   observations o
   WHERE  o.station_id = s.station_id  -- lateral reference
   ORDER  BY o.created_at DESC
   LIMIT  #{n}  -- your limit here
   ) o
ORDER  BY s.station_id, o.created_at DESC;

If you don't have a table of stations, the next best thing would be to create and maintain one. Possibly add a foreign key reference to enforce relational integrity.

If that's not an option, you can distill such a table on the fly. Simple options would be:

SELECT DISTINCT station_id FROM observations;
SELECT station_id FROM observations GROUP BY 1;

But either would need a sequential scan and be slow. Make Postgres use above index (or any btree index with station_id as leading column) with a recursive CTE:

WITH RECURSIVE stations AS (
   (                  -- extra pair of parentheses ...
   SELECT station_id
   FROM   observations
   ORDER  BY station_id
   LIMIT  1
   )                  -- ... is required!
   UNION ALL
   SELECT (SELECT o.station_id
           FROM   observations o
           WHERE  o.station_id > s.station_id
           ORDER  BY o.station_id
           LIMIT  1)
   FROM   stations s
   WHERE  s.station_id IS NOT NULL  -- serves as break condition
   )
SELECT station_id
FROM   stations
WHERE  station_id IS NOT NULL;      -- remove dangling row with NULL

Use that as drop-in replacement for the stations table in the above simple query:

WITH RECURSIVE stations AS (
   (
   SELECT station_id
   FROM   observations
   ORDER  BY station_id
   LIMIT  1
   )
   UNION ALL
   SELECT (SELECT o.station_id
           FROM   observations o
           WHERE  o.station_id > s.station_id
           ORDER  BY o.station_id
           LIMIT  1)
   FROM   stations s
   WHERE  s.station_id IS NOT NULL
   )
SELECT o.id
FROM   stations s
CROSS  JOIN LATERAL (
   SELECT o.id, o.created_at
   FROM   observations o
   WHERE  o.station_id = s.station_id
   ORDER  BY o.created_at DESC
   LIMIT  #{n}  -- your limit here
   ) o
WHERE  s.station_id IS NOT NULL
ORDER  BY s.station_id, o.created_at DESC;

This should still be faster than what you had by orders of magnitude.

db<>fiddle here
_{Old sqlfiddle}

这篇关于每行查询最后 N 个相关行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

每行查询最后 N 个相关行 [英] Query last N related rows per row

问题描述

索引

更简单的查询(仍然很慢)

快速查询

Index

Simpler query (still slow)

Fast query

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

每行查询最后 N 个相关行 [英] Query last N related rows per row

问题描述

索引

更简单的查询(仍然很慢)

快速查询

Index

Simpler query (still slow)

Fast query

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭