优化分组最大查询 [英] Optimize groupwise maximum query

查看:89
本文介绍了优化分组最大查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

select * 
from records 
where id in ( select max(id) from records group by option_id )

该查询即使在数百万行上也可以正常工作.但是,正如您从explain语句的结果中看到的那样:

This query works fine even on millions of rows. However as you can see from the result of explain statement:

                                               QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop  (cost=30218.84..31781.62 rows=620158 width=44) (actual time=1439.251..1443.458 rows=1057 loops=1)
->  HashAggregate  (cost=30218.41..30220.41 rows=200 width=4) (actual time=1439.203..1439.503 rows=1057 loops=1)
     ->  HashAggregate  (cost=30196.72..30206.36 rows=964 width=8) (actual time=1438.523..1438.807 rows=1057 loops=1)
           ->  Seq Scan on records records_1  (cost=0.00..23995.15 rows=1240315 width=8) (actual time=0.103..527.914 rows=1240315 loops=1)
->  Index Scan using records_pkey on records  (cost=0.43..7.80 rows=1 width=44) (actual time=0.002..0.003 rows=1 loops=1057)
     Index Cond: (id = (max(records_1.id)))
Total runtime: 1443.752 ms

(cost=0.00..23995.15 rows=1240315 width=8)<-此处表示正在扫描所有行,这显然效率低下.

(cost=0.00..23995.15 rows=1240315 width=8) <- Here it says it is scanning all rows and that is obviously inefficient.

我也尝试过重新排序查询:

I also tried reordering the query:

select r.* from records r
inner join (select max(id) id from records group by option_id) r2 on r2.id= r.id;

                                               QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------

Nested Loop  (cost=30197.15..37741.04 rows=964 width=44) (actual time=835.519..840.452 rows=1057 loops=1)
->  HashAggregate  (cost=30196.72..30206.36 rows=964 width=8) (actual time=835.471..835.836 rows=1057 loops=1)
     ->  Seq Scan on records  (cost=0.00..23995.15 rows=1240315 width=8) (actual time=0.336..348.495 rows=1240315 loops=1)
->  Index Scan using records_pkey on records r  (cost=0.43..7.80 rows=1 width=44) (actual time=0.003..0.003 rows=1 loops=1057)
     Index Cond: (id = (max(records.id)))
Total runtime: 840.809 ms

(cost=0.00..23995.15 rows=1240315 width=8)<-仍在扫描所有行.

(cost=0.00..23995.15 rows=1240315 width=8) <- Still scanning all rows.

我尝试在(option_id)(option_id, id)(option_id, id desc)上使用索引和不使用索引,它们都没有对查询计划产生任何影响.

I tried with and without index on (option_id), (option_id, id), (option_id, id desc), none of them had any effect on the query plan.

有没有一种方法可以在Postgres中执行逐组最大查询而不扫描所有行?

Is there a way of executing a groupwise maximum query in Postgres without scanning all rows?

我正在以编程方式寻找的索引是存储每个option_id插入记录表时的最大id.这样,当我查询option_id的最大值时,我只需要扫描索引记录的次数就可以达到与其他option_ids一样多的次数.

What I am looking for, programmatically, is an index which stores the maximum id for each option_id as they are inserted into the records table. That way, when I query for maximums of option_ids, I should only need to scan index records as many times as there are different option_ids.

我已经看到select distinct on回答了来自高级用户的所有问题(感谢@Clodoaldo Neto为我提供了要搜索的关键字).这就是为什么它不起作用的原因:

I've seen select distinct on answers all over SO from high ranking users (thanks to @Clodoaldo Neto for giving me keywords to search for). Here's why it doesn't work:

create index index_name on records(option_id, id desc)

select distinct on (option_id) *
from records
order by option_id, id desc
                                               QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique  (cost=0.43..76053.10 rows=964 width=44) (actual time=0.049..1668.545 rows=1056 loops=1)
  ->  Index Scan using records_option_id_id_idx on records  (cost=0.43..73337.25 rows=1086342 width=44) (actual time=0.046..1368.300 rows=1086342 loops=1)
Total runtime: 1668.817 ms

太好了,它正在使用索引.但是,使用索引扫描所有ID并没有多大意义.根据我的处决,它实际上比简单的顺序扫描要慢.

That's great, it's using an index. However using an index to scan all ids doesn't really make much sense. According to my executions, it is in fact slower than a simple sequential scan.

有趣的是,MySQL 5.5能够仅使用records(option_id, id)

Interesting enough, MySQL 5.5 is able to optimize the query simply using an index on records(option_id, id)

mysql> select count(1) from records;

+----------+
| count(1) |
+----------+
|  1086342 |
+----------+

1 row in set (0.00 sec)

mysql> explain extended select * from records
       inner join ( select max(id) max_id from records group by option_id ) mr
                                                      on mr.max_id= records.id;

+------+----------+--------------------------+
| rows | filtered | Extra                    |
+------+----------+--------------------------+
| 1056 |   100.00 |                          |
|    1 |   100.00 |                          |
|  201 |   100.00 | Using index for group-by |
+------+----------+--------------------------+

3 rows in set, 1 warning (0.02 sec)

推荐答案

假设options 中的许多行相对 options 中的行相对较少.

Assuming relatively few rows in options for many rows in records.

通常,您将有一个从records.option_id引用的查询options ,最好是

Typically, you would have a look-up table options that is referenced from records.option_id, ideally with a foreign key constraint. If you don't, I suggest to create one to enforce referential integrity:

CREATE TABLE options (
  option_id int  PRIMARY KEY
, option    text UNIQUE NOT NULL
);

INSERT INTO options
SELECT DISTINCT option_id, 'option' || option_id -- dummy option names
FROM   records;

然后,我们无需再模拟松散索引扫描,它就变成非常简单快捷.相关的子查询可以在(option_id, id)上使用普通索引.

Then we have no need to emulate a loose index scan any more and this becomes very simple and fast. Correlated subqueries can use a plain index on (option_id, id).

SELECT option_id
      ,(SELECT max(id)
        FROM   records
        WHERE  option_id = o.option_id
       ) AS max_id
FROM   options o
ORDER  BY 1;

这包括表records中不匹配的选项.对于max_id,您将获得NULL,并且可以根据需要轻松地在外部SELECT中删除此类行.

This includes options with no match in table records. You get NULL for max_id and you can easily remove such rows in an outer SELECT if needed.

或(结果相同):

SELECT option_id
     , (SELECT id
        FROM   records
        WHERE  option_id = o.option_id
        ORDER  BY id DESC NULLS LAST
       ) AS max_id
FROM   options o
ORDER  BY 1;

可能会快一点.子查询使用排序顺序DESC NULLS LAST-与忽略NULL值的聚合函数max()相同.仅排序DESC首先将为NULL:

May be a bit faster. The subquery uses the sort order DESC NULLS LAST - same as the aggregate function max() which ignores NULL values. Sorting just DESC would have NULL first:

因此,完美的索引:

CREATE INDEX on records (option_id, id DESC NULLS LAST);

在定义列NOT NULL时没什么关系.

Doesn't matter much while columns are defined NOT NULL.

在小表options上仍然可以进行顺序扫描,这只是获取所有行的最快方法. ORDER BY可能会引入索引扫描(仅)以获取预排序的行.
仅通过(位图)索引扫描访问大表records-或(如果可能)通过 仅索引扫描 .

There can still be a sequential scan on the small table options, that's just the fastest way to fetch all rows. The ORDER BY may bring in an index (only) scan to fetch pre-sorted rows.
The big table records is only accessed via (bitmap) index scan - or, if possible, index-only scan.

SQL小提琴 显示了两个仅索引扫描简单的情况.

SQL Fiddle showing two index-only scans for the simple case.

Or 使用LATERAL连接在Postgres 9.3+中具有类似的效果:

Or use LATERAL joins for a similar effect in Postgres 9.3+:

这篇关于优化分组最大查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆