为什么Postgres不在简单的GROUP BY上使用索引? [英] Why is Postgres not using index on a simple GROUP BY?
问题描述
我创建了一个具有type
列索引的3,600万行表:
I have created a 36M rows table with an index on type
column:
CREATE TABLE items AS
SELECT
(random()*36000000)::integer AS id,
(random()*10000)::integer AS type,
md5(random()::text) AS s
FROM
generate_series(1,36000000);
CREATE INDEX items_type_idx ON items USING btree ("type");
我运行此简单查询,并期望postgresql使用我的索引:
I run this simple query and expect postgresql to use my index:
explain select count(*) from "items" group by "type";
但是查询计划者决定改为使用Seq扫描:
But the query planner decides to use Seq Scan instead:
HashAggregate (cost=734592.00..734627.90 rows=3590 width=12) (actual time=6477.913..6478.344 rows=3601 loops=1)
Group Key: type
-> Seq Scan on items (cost=0.00..554593.00 rows=35999800 width=4) (actual time=0.044..1820.522 rows=36000000 loops=1)
Planning time: 0.107 ms
Execution time: 6478.525 ms
没有解释的时间: 5s 979ms
Time without EXPLAIN: 5s 979ms
I have tried several solutions from here and here:
- 运行
VACUUM ANALYZE
或VACUUM ANALYZE
- 配置
default_statistics_target
,random_page_cost
,work_mem
- Run
VACUUM ANALYZE
orVACUUM ANALYZE
- Configure
default_statistics_target
,random_page_cost
,work_mem
但设置enable_seqscan = OFF
没什么帮助:
SET enable_seqscan = OFF;
explain select count(*) from "items" group by "type";
GroupAggregate (cost=0.56..1114880.46 rows=3590 width=12) (actual time=5.637..5256.406 rows=3601 loops=1)
Group Key: type
-> Index Only Scan using items_type_idx on items (cost=0.56..934845.56 rows=35999800 width=4) (actual time=0.074..2783.896 rows=36000000 loops=1)
Heap Fetches: 0
Planning time: 0.103 ms
Execution time: 5256.667 ms
没有解释的时间: 659ms
在我的计算机上使用索引扫描进行查询的速度大约快10倍.
Query with index scan is about 10x faster on my machine.
有没有比设置enable_seqscan
更好的解决方案?
Is there a better solution than setting enable_seqscan
?
UPD1
我的postgresql版本是9.6.3,work_mem = 4MB(尝试64MB),random_page_cost = 4(尝试1.1),max_parallel_workers_per_gather = 0(尝试4).
My postgresql version is 9.6.3, work_mem = 4MB (tried 64MB), random_page_cost = 4 (tried 1.1), max_parallel_workers_per_gather = 0 (tried 4).
UPD2
我尝试不使用随机数来填充 type 列,而是使用i / 10000
来填充pg_stats.correlation
= 1-仍为seqscan.
I have tried to fill type column not with random numbers, but with i / 10000
to make pg_stats.correlation
= 1 - still seqscan.
UPD3
@jgh是100%正确:
@jgh is 100% right:
通常仅在表的行宽比某些索引宽得多时才会发生
This typically only happens when the table's row width is much wider than some indexes
我制作了大列data
,现在postgres使用了索引.谢谢大家!
I've made large column data
and now postgres use index. Thanks everyone!
推荐答案
仅索引扫描维基说
重要的是要意识到计划者与 最小化查询的总成本.有了数据库,成本 I/O通常占主导地位.因此,"count(*)不带任何 谓词"查询仅在索引为 比它的桌子小得多.这通常仅在以下情况下发生 该表的行宽度比某些索引的行宽度宽得多..
It is important to realise that the planner is concerned with minimising the total cost of the query. With databases, the cost of I/O typically dominates. For that reason, "count(*) without any predicate" queries will only use an index-only scan if the index is significantly smaller than its table. This typically only happens when the table's row width is much wider than some indexes'.
和
仅索引扫描仅在计划者认为 据称,它将减少所需的I/O总量 基于成本的建模不完善.这全都取决于能见度 的元组,如果仍然使用索引(即 谓词,等等),如果实际上有可用的索引, 原则上可以用于仅索引扫描
Index-only scans are only used when the planner surmises that that will reduce the total amount of I/O required, according to its imperfect cost-based modelling. This all heavily depends on visibility of tuples, if an index would be used anyway (i.e. how selective a predicate is, etc), and if there is actually an index available that could be used by an index-only scan in principle
因此,您的索引不会被认为明显更小",而是要读取整个数据集,这会导致计划者使用seq扫描
Accordingly, your index is not considered "significantly smaller" and the entire dataset is to be read, which leads the planner in using a seq scan
这篇关于为什么Postgres不在简单的GROUP BY上使用索引?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!