优化分组最大查询 [英] Optimize groupwise maximum query
问题描述
select *
from records
where id in ( select max(id) from records group by option_id )
该查询即使在数百万行上也可以正常工作.但是,正如您从explain语句的结果中看到的那样:
This query works fine even on millions of rows. However as you can see from the result of explain statement:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=30218.84..31781.62 rows=620158 width=44) (actual time=1439.251..1443.458 rows=1057 loops=1)
-> HashAggregate (cost=30218.41..30220.41 rows=200 width=4) (actual time=1439.203..1439.503 rows=1057 loops=1)
-> HashAggregate (cost=30196.72..30206.36 rows=964 width=8) (actual time=1438.523..1438.807 rows=1057 loops=1)
-> Seq Scan on records records_1 (cost=0.00..23995.15 rows=1240315 width=8) (actual time=0.103..527.914 rows=1240315 loops=1)
-> Index Scan using records_pkey on records (cost=0.43..7.80 rows=1 width=44) (actual time=0.002..0.003 rows=1 loops=1057)
Index Cond: (id = (max(records_1.id)))
Total runtime: 1443.752 ms
(cost=0.00..23995.15 rows=1240315 width=8)
<-此处表示正在扫描所有行,这显然效率低下.
(cost=0.00..23995.15 rows=1240315 width=8)
<- Here it says it is scanning all rows and that is obviously inefficient.
我也尝试过重新排序查询:
I also tried reordering the query:
select r.* from records r
inner join (select max(id) id from records group by option_id) r2 on r2.id= r.id;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=30197.15..37741.04 rows=964 width=44) (actual time=835.519..840.452 rows=1057 loops=1)
-> HashAggregate (cost=30196.72..30206.36 rows=964 width=8) (actual time=835.471..835.836 rows=1057 loops=1)
-> Seq Scan on records (cost=0.00..23995.15 rows=1240315 width=8) (actual time=0.336..348.495 rows=1240315 loops=1)
-> Index Scan using records_pkey on records r (cost=0.43..7.80 rows=1 width=44) (actual time=0.003..0.003 rows=1 loops=1057)
Index Cond: (id = (max(records.id)))
Total runtime: 840.809 ms
(cost=0.00..23995.15 rows=1240315 width=8)
<-仍在扫描所有行.
(cost=0.00..23995.15 rows=1240315 width=8)
<- Still scanning all rows.
我尝试在(option_id)
,(option_id, id)
,(option_id, id desc)
上使用索引和不使用索引,它们都没有对查询计划产生任何影响.
I tried with and without index on (option_id)
, (option_id, id)
, (option_id, id desc)
, none of them had any effect on the query plan.
有没有一种方法可以在Postgres中执行逐组最大查询而不扫描所有行?
Is there a way of executing a groupwise maximum query in Postgres without scanning all rows?
我正在以编程方式寻找的索引是存储每个option_id
插入记录表时的最大id.这样,当我查询option_id的最大值时,我只需要扫描索引记录的次数就可以达到与其他option_ids一样多的次数.
What I am looking for, programmatically, is an index which stores the maximum id for each option_id
as they are inserted into the records table. That way, when I query for maximums of option_ids, I should only need to scan index records as many times as there are different option_ids.
我已经看到select distinct on
回答了来自高级用户的所有问题(感谢@Clodoaldo Neto为我提供了要搜索的关键字).这就是为什么它不起作用的原因:
I've seen select distinct on
answers all over SO from high ranking users (thanks to @Clodoaldo Neto for giving me keywords to search for). Here's why it doesn't work:
create index index_name on records(option_id, id desc)
select distinct on (option_id) *
from records
order by option_id, id desc
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.43..76053.10 rows=964 width=44) (actual time=0.049..1668.545 rows=1056 loops=1)
-> Index Scan using records_option_id_id_idx on records (cost=0.43..73337.25 rows=1086342 width=44) (actual time=0.046..1368.300 rows=1086342 loops=1)
Total runtime: 1668.817 ms
太好了,它正在使用索引.但是,使用索引扫描所有ID并没有多大意义.根据我的处决,它实际上比简单的顺序扫描要慢.
That's great, it's using an index. However using an index to scan all ids doesn't really make much sense. According to my executions, it is in fact slower than a simple sequential scan.
有趣的是,MySQL 5.5能够仅使用records(option_id, id)
Interesting enough, MySQL 5.5 is able to optimize the query simply using an index on records(option_id, id)
mysql> select count(1) from records;
+----------+
| count(1) |
+----------+
| 1086342 |
+----------+
1 row in set (0.00 sec)
mysql> explain extended select * from records
inner join ( select max(id) max_id from records group by option_id ) mr
on mr.max_id= records.id;
+------+----------+--------------------------+
| rows | filtered | Extra |
+------+----------+--------------------------+
| 1056 | 100.00 | |
| 1 | 100.00 | |
| 201 | 100.00 | Using index for group-by |
+------+----------+--------------------------+
3 rows in set, 1 warning (0.02 sec)
推荐答案
假设options
中的许多行相对 options
中的行相对较少.
Assuming relatively few rows in options
for many rows in records
.
通常,您将有一个从records.option_id
引用的查询表options
,最好是
Typically, you would have a look-up table options
that is referenced from records.option_id
, ideally with a foreign key constraint. If you don't, I suggest to create one to enforce referential integrity:
CREATE TABLE options (
option_id int PRIMARY KEY
, option text UNIQUE NOT NULL
);
INSERT INTO options
SELECT DISTINCT option_id, 'option' || option_id -- dummy option names
FROM records;
然后,我们无需再模拟松散索引扫描,它就变成非常简单快捷.相关的子查询可以在(option_id, id)
上使用普通索引.
Then we have no need to emulate a loose index scan any more and this becomes very simple and fast. Correlated subqueries can use a plain index on (option_id, id)
.
SELECT option_id
,(SELECT max(id)
FROM records
WHERE option_id = o.option_id
) AS max_id
FROM options o
ORDER BY 1;
这包括表records
中不匹配的选项.对于max_id
,您将获得NULL,并且可以根据需要轻松地在外部SELECT
中删除此类行.
This includes options with no match in table records
. You get NULL for max_id
and you can easily remove such rows in an outer SELECT
if needed.
或(结果相同):
SELECT option_id
, (SELECT id
FROM records
WHERE option_id = o.option_id
ORDER BY id DESC NULLS LAST
) AS max_id
FROM options o
ORDER BY 1;
可能会快一点.子查询使用排序顺序DESC NULLS LAST
-与忽略NULL值的聚合函数max()
相同.仅排序DESC
首先将为NULL:
May be a bit faster. The subquery uses the sort order DESC NULLS LAST
- same as the aggregate function max()
which ignores NULL values. Sorting just DESC
would have NULL first:
因此,完美的索引:
CREATE INDEX on records (option_id, id DESC NULLS LAST);
在定义列NOT NULL
时没什么关系.
Doesn't matter much while columns are defined NOT NULL
.
在小表options
上仍然可以进行顺序扫描,这只是获取所有行的最快方法. ORDER BY
可能会引入索引扫描(仅)以获取预排序的行.
仅通过(位图)索引扫描访问大表records
-或(如果可能)通过 仅索引扫描 .
There can still be a sequential scan on the small table options
, that's just the fastest way to fetch all rows. The ORDER BY
may bring in an index (only) scan to fetch pre-sorted rows.
The big table records
is only accessed via (bitmap) index scan - or, if possible, index-only scan.
SQL小提琴 显示了两个仅索引扫描简单的情况.
SQL Fiddle showing two index-only scans for the simple case.
Or 使用LATERAL
连接在Postgres 9.3+中具有类似的效果:
Or use LATERAL
joins for a similar effect in Postgres 9.3+:
这篇关于优化分组最大查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!