查询每行最后N个相关行 [英] Query last N related rows per row

查看：56 发布时间：2018/8/2 13:15:38 sql performance postgresql indexing query-optimization

本文介绍了查询每行最后N个相关行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有以下查询，为每个观察的 id > station ：

I have the following query which fetches the id of the latest N observations for each station:

SELECT id
FROM (
  SELECT station_id, id, created_at,
         row_number() OVER(PARTITION BY station_id
                           ORDER BY created_at DESC) AS rn
  FROM (
      SELECT station_id, id, created_at
      FROM observations
  ) s
) s
WHERE rn <= #{n}
ORDER BY station_id, created_at DESC;

我的指数在 id ， station_id ， created_at 。

这是我唯一的解决方案想出来，每站可以获取多个记录。但它很慢（对于81000条记录的表格为154.0毫秒）。

This is the only solution I have come up with that can fetch more than a single record per station. However it is quite slow (154.0 ms for a table of 81000 records).

如何加快查询速度？

推荐答案

假设当前版本为Postgres 9.3。

Assuming the current version Postgres 9.3.

首先，多列索引将有所帮助：

First, a multicolumn index will help:

CREATE INDEX observations_special_idx
ON observations(station_id, created_at DESC, id)

created_at DESC 稍微好一点，但索引仍然会以几乎相同的速度向后扫描，而不需要 DESC 。

created_at DESC is a slightly better fit, but the index would still be scanned backwards at almost the same speed without DESC.

假设 created_at 定义 NOT NULL ，否则考虑 DESC NULLS LAST索引和查询中的：

Assuming created_at is defined NOT NULL, else consider DESC NULLS LAST in index and query:

PostgreSQL按日期时间asc排序，先出现空值？

PostgreSQL sort by datetime asc, null first?

最后一列 id 仅在获得仅限索引扫描，如果不断添加大量新行，这可能无效。在这种情况下，从索引中删除 id 。

The last column id is only useful if you get an index-only scan out of this, which probably won't work if you add lots of new rows constantly. In this case, remove id from the index.

简化您的查询，内部子选择无效：

Simplify your query, the inner subselect doesn't help:

SELECT id
FROM  (
  SELECT station_id, id, created_at
       , row_number() OVER (PARTITION BY station_id
                            ORDER BY created_at DESC) AS rn
  FROM   observations
  ) s
WHERE  rn <= #{n}
ORDER  BY station_id, created_at DESC;

应该快一点，但仍然很慢。

Should be a bit faster, but still slow.

假设您有相对 少数站且相对较弱每个站许多观察。

还假设 station_id id定义为 NOT NULL 。

Assuming you have relatively few stations and relatively many observations per station.
Also assuming station_id id defined as NOT NULL.

要真的快，你需要相当于一个松散索引扫描（未在Postgres中实现）。相关回答：

To be really fast, you need the equivalent of a loose index scan (not implemented in Postgres). Related answer:

优化GROUP BY查询以检索每个用户的最新记录

Optimize GROUP BY query to retrieve latest record per user

如果你有一个单独的站表（似乎很可能），你可以用 JOIN LATERAL （Postgres 9.3+ ）：

If you have a separate table of stations (which seems likely), you can emulate this with JOIN LATERAL (Postgres 9.3+):

SELECT o.id
FROM   stations s
JOIN   LATERAL (
   SELECT id, created_at
   FROM   observations
   WHERE  station_id = s.id  -- lateral reference
   ORDER  BY created_at DESC
   LIMIT  #{n}
   ) o ON TRUE
ORDER  BY s.id, o.created_at DESC;

如果您没有的电台，下一个最好的事情是创建和维护一个。可能添加外键引用以强制执行关系完整性。

If you don't have a table of stations, the next best thing would be to create and maintain one. Possibly add a foreign key reference to enforce relational integrity.

如果这不是一个选项，您可以动态提取这样的表。简单的选项是：

If that's not an option, you can distill such a table on the fly. Simple options would be:

SELECT DISTINCT station_id FROM observations;
SELECT station_id FROM observations GROUP BY 1;

但那些需要顺序扫描并且是太慢了。使用上面的索引（或任何btree索引， station_id 作为前导列）使用递归CTE ：

But those would need a sequential scan and be too slow. Trick Postgres into using above index (or any btree index with station_id as leading column) with a recursive CTE:

WITH RECURSIVE stations AS (
   (                  -- extra pair of parentheses ...
   SELECT station_id
   FROM   observations
   ORDER  BY station_id
   LIMIT  1
   )                  -- ... is required!
   UNION ALL
   SELECT (SELECT station_id
           FROM   observations
           WHERE  station_id > s.station_id
           ORDER  BY station_id
           LIMIT  1)
   FROM   stations s
   WHERE  s.station_id IS NOT NULL  -- serves as break condition
   )
SELECT station_id
FROM   stations
WHERE  station_id IS NOT NULL;      -- remove dangling row with NULL

将其用作直接替换对于上述简单查询中的站表：

Use that as drop-in replacement for the stations table in the above simple query:

WITH RECURSIVE stations AS (
   (
   SELECT station_id
   FROM   observations
   ORDER  BY station_id
   LIMIT  1
   )
   UNION ALL
   SELECT (SELECT station_id
           FROM   observations
           WHERE  station_id > s.station_id
           ORDER  BY station_id
           LIMIT  1)
   FROM   stations s
   WHERE  s.station_id IS NOT NULL
   )
SELECT o.id
FROM   stations s
JOIN   LATERAL (
   SELECT id, created_at
   FROM   observations
   WHERE  station_id = s.station_id
   ORDER  BY created_at DESC
   LIMIT  #{n}
   ) o ON TRUE
WHERE  s.station_id IS NOT NULL
ORDER  BY s.station_id, o.created_at DESC;

这仍应比数量级的速度快。

SQL小提琴。

这篇关于查询每行最后N个相关行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

查询每行最后N个相关行 [英] Query last N related rows per row

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

查询每行最后N个相关行 [英] Query last N related rows per row

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭