查询每行最后N个相关行 [英] Query last N related rows per row
问题描述
我有以下查询,为每个观察
的 id
> station :
I have the following query which fetches the id
of the latest N observations
for each station
:
SELECT id
FROM (
SELECT station_id, id, created_at,
row_number() OVER(PARTITION BY station_id
ORDER BY created_at DESC) AS rn
FROM (
SELECT station_id, id, created_at
FROM observations
) s
) s
WHERE rn <= #{n}
ORDER BY station_id, created_at DESC;
我的指数在 id
, station_id
, created_at
。
这是我唯一的解决方案想出来,每站可以获取多个记录。但它很慢(对于81000条记录的表格为154.0毫秒)。
This is the only solution I have come up with that can fetch more than a single record per station. However it is quite slow (154.0 ms for a table of 81000 records).
如何加快查询速度?
推荐答案
假设当前版本为Postgres 9.3。
Assuming the current version Postgres 9.3.
首先,多列索引将有所帮助:
First, a multicolumn index will help:
CREATE INDEX observations_special_idx
ON observations(station_id, created_at DESC, id)
created_at DESC
稍微好一点,但索引仍然会以几乎相同的速度向后扫描,而不需要 DESC
。
created_at DESC
is a slightly better fit, but the index would still be scanned backwards at almost the same speed without DESC
.
假设 created_at
定义 NOT NULL
,否则考虑 DESC NULLS LAST索引和查询中的
:
Assuming created_at
is defined NOT NULL
, else consider DESC NULLS LAST
in index and query:
- PostgreSQL sort by datetime asc, null first?
最后一列 id
仅在获得仅限索引扫描,如果不断添加大量新行,这可能无效。在这种情况下,从索引中删除 id
。
The last column id
is only useful if you get an index-only scan out of this, which probably won't work if you add lots of new rows constantly. In this case, remove id
from the index.
简化您的查询,内部子选择无效:
Simplify your query, the inner subselect doesn't help:
SELECT id
FROM (
SELECT station_id, id, created_at
, row_number() OVER (PARTITION BY station_id
ORDER BY created_at DESC) AS rn
FROM observations
) s
WHERE rn <= #{n}
ORDER BY station_id, created_at DESC;
应该快一点,但仍然很慢。
Should be a bit faster, but still slow.
- 假设您有相对 少数站且相对较弱每个站许多观察。
- 还假设
station_id
id定义为NOT NULL
。
- Assuming you have relatively few stations and relatively many observations per station.
- Also assuming
station_id
id defined asNOT NULL
.
要真的快,你需要相当于一个松散索引扫描(未在Postgres中实现)。相关回答:
To be really fast, you need the equivalent of a loose index scan (not implemented in Postgres). Related answer:
- Optimize GROUP BY query to retrieve latest record per user
如果你有一个单独的站
表(似乎很可能),你可以用 JOIN LATERAL
(Postgres 9.3+ ):
If you have a separate table of stations
(which seems likely), you can emulate this with JOIN LATERAL
(Postgres 9.3+):
SELECT o.id
FROM stations s
JOIN LATERAL (
SELECT id, created_at
FROM observations
WHERE station_id = s.id -- lateral reference
ORDER BY created_at DESC
LIMIT #{n}
) o ON TRUE
ORDER BY s.id, o.created_at DESC;
如果您没有的电台
,下一个最好的事情是创建和维护一个。可能添加外键引用以强制执行关系完整性。
If you don't have a table of stations
, the next best thing would be to create and maintain one. Possibly add a foreign key reference to enforce relational integrity.
如果这不是一个选项,您可以动态提取这样的表。简单的选项是:
If that's not an option, you can distill such a table on the fly. Simple options would be:
SELECT DISTINCT station_id FROM observations;
SELECT station_id FROM observations GROUP BY 1;
但那些需要顺序扫描并且是太慢了。使用上面的索引(或任何btree索引, station_id
作为前导列)使用递归CTE :
But those would need a sequential scan and be too slow. Trick Postgres into using above index (or any btree index with station_id
as leading column) with a recursive CTE:
WITH RECURSIVE stations AS (
( -- extra pair of parentheses ...
SELECT station_id
FROM observations
ORDER BY station_id
LIMIT 1
) -- ... is required!
UNION ALL
SELECT (SELECT station_id
FROM observations
WHERE station_id > s.station_id
ORDER BY station_id
LIMIT 1)
FROM stations s
WHERE s.station_id IS NOT NULL -- serves as break condition
)
SELECT station_id
FROM stations
WHERE station_id IS NOT NULL; -- remove dangling row with NULL
将其用作直接替换对于上述简单查询中的站
表:
Use that as drop-in replacement for the stations
table in the above simple query:
WITH RECURSIVE stations AS (
(
SELECT station_id
FROM observations
ORDER BY station_id
LIMIT 1
)
UNION ALL
SELECT (SELECT station_id
FROM observations
WHERE station_id > s.station_id
ORDER BY station_id
LIMIT 1)
FROM stations s
WHERE s.station_id IS NOT NULL
)
SELECT o.id
FROM stations s
JOIN LATERAL (
SELECT id, created_at
FROM observations
WHERE station_id = s.station_id
ORDER BY created_at DESC
LIMIT #{n}
) o ON TRUE
WHERE s.station_id IS NOT NULL
ORDER BY s.station_id, o.created_at DESC;
这仍应比数量级的速度快。
这篇关于查询每行最后N个相关行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!