查询每行最后N个相关行 [英] Query last N related rows per row

查看:56
本文介绍了查询每行最后N个相关行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下查询,为每个观察的 id > station :

I have the following query which fetches the id of the latest N observations for each station:

SELECT id
FROM (
  SELECT station_id, id, created_at,
         row_number() OVER(PARTITION BY station_id
                           ORDER BY created_at DESC) AS rn
  FROM (
      SELECT station_id, id, created_at
      FROM observations
  ) s
) s
WHERE rn <= #{n}
ORDER BY station_id, created_at DESC;

我的指数在 id station_id created_at

这是我唯一的解决方案想出来,每站可以获取多个记录。但它很慢(对于81000条记录的表格为154.0毫秒)。

This is the only solution I have come up with that can fetch more than a single record per station. However it is quite slow (154.0 ms for a table of 81000 records).

如何加快查询速度?

推荐答案

假设当前版本为Postgres 9.3。

Assuming the current version Postgres 9.3.

首先,多列索引将有所帮助:

First, a multicolumn index will help:

CREATE INDEX observations_special_idx
ON observations(station_id, created_at DESC, id)

created_at DESC 稍微好一点,但索引仍然会以几乎相同的速度向后扫描,而不需要 DESC

created_at DESC is a slightly better fit, but the index would still be scanned backwards at almost the same speed without DESC.

假设 created_at 定义 NOT NULL ,否则考虑 DESC NULLS LAST索引查询中的

Assuming created_at is defined NOT NULL, else consider DESC NULLS LAST in index and query:

  • PostgreSQL sort by datetime asc, null first?

最后一列 id 仅在获得仅限索引扫描,如果不断添加大量新行,这可能无效。在这种情况下,从索引中删除 id

The last column id is only useful if you get an index-only scan out of this, which probably won't work if you add lots of new rows constantly. In this case, remove id from the index.

简化您的查询,内部子选择无效:

Simplify your query, the inner subselect doesn't help:

SELECT id
FROM  (
  SELECT station_id, id, created_at
       , row_number() OVER (PARTITION BY station_id
                            ORDER BY created_at DESC) AS rn
  FROM   observations
  ) s
WHERE  rn <= #{n}
ORDER  BY station_id, created_at DESC;

应该快一点,但仍然很慢。

Should be a bit faster, but still slow.


  • 假设您有相对 少数且相对较弱每个站许多观察。

  • 还假设 station_id id定义为 NOT NULL

  • Assuming you have relatively few stations and relatively many observations per station.
  • Also assuming station_id id defined as NOT NULL.

真的快,你需要相当于一个松散索引扫描(未在Postgres中实现)。相关回答:

To be really fast, you need the equivalent of a loose index scan (not implemented in Postgres). Related answer:

  • Optimize GROUP BY query to retrieve latest record per user

如果你有一个单独的表(似乎很可能),你可以用 JOIN LATERAL (Postgres 9.3+ ):

If you have a separate table of stations (which seems likely), you can emulate this with JOIN LATERAL (Postgres 9.3+):

SELECT o.id
FROM   stations s
JOIN   LATERAL (
   SELECT id, created_at
   FROM   observations
   WHERE  station_id = s.id  -- lateral reference
   ORDER  BY created_at DESC
   LIMIT  #{n}
   ) o ON TRUE
ORDER  BY s.id, o.created_at DESC;

如果您没有的电台,下一个最好的事情是创建和维护一个。可能添加外键引用以强制执行关系完整性。

If you don't have a table of stations, the next best thing would be to create and maintain one. Possibly add a foreign key reference to enforce relational integrity.

如果这不是一个选项,您可以动态提取这样的表。简单的选项是:

If that's not an option, you can distill such a table on the fly. Simple options would be:

SELECT DISTINCT station_id FROM observations;
SELECT station_id FROM observations GROUP BY 1;

但那些需要顺序扫描并且是太慢了。使用上面的索引(或任何btree索引, station_id 作为前导列)使用递归CTE

But those would need a sequential scan and be too slow. Trick Postgres into using above index (or any btree index with station_id as leading column) with a recursive CTE:

WITH RECURSIVE stations AS (
   (                  -- extra pair of parentheses ...
   SELECT station_id
   FROM   observations
   ORDER  BY station_id
   LIMIT  1
   )                  -- ... is required!
   UNION ALL
   SELECT (SELECT station_id
           FROM   observations
           WHERE  station_id > s.station_id
           ORDER  BY station_id
           LIMIT  1)
   FROM   stations s
   WHERE  s.station_id IS NOT NULL  -- serves as break condition
   )
SELECT station_id
FROM   stations
WHERE  station_id IS NOT NULL;      -- remove dangling row with NULL

将其用作直接替换对于上述简单查询中的表:

Use that as drop-in replacement for the stations table in the above simple query:

WITH RECURSIVE stations AS (
   (
   SELECT station_id
   FROM   observations
   ORDER  BY station_id
   LIMIT  1
   )
   UNION ALL
   SELECT (SELECT station_id
           FROM   observations
           WHERE  station_id > s.station_id
           ORDER  BY station_id
           LIMIT  1)
   FROM   stations s
   WHERE  s.station_id IS NOT NULL
   )
SELECT o.id
FROM   stations s
JOIN   LATERAL (
   SELECT id, created_at
   FROM   observations
   WHERE  station_id = s.station_id
   ORDER  BY created_at DESC
   LIMIT  #{n}
   ) o ON TRUE
WHERE  s.station_id IS NOT NULL
ORDER  BY s.station_id, o.created_at DESC;

这仍应比数量级的速度快。

SQL小提琴。

这篇关于查询每行最后N个相关行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆