为什么Postgres不在简单的GROUP BY上使用索引? [英] Why is Postgres not using index on a simple GROUP BY?

查看:788
本文介绍了为什么Postgres不在简单的GROUP BY上使用索引?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我创建了一个具有type列索引的3,600万行表:

I have created a 36M rows table with an index on type column:

CREATE TABLE items AS
  SELECT
    (random()*36000000)::integer AS id,
    (random()*10000)::integer AS type,
    md5(random()::text) AS s
  FROM
    generate_series(1,36000000);
CREATE INDEX items_type_idx ON items USING btree ("type");

我运行此简单查询,并期望postgresql使用我的索引:

I run this simple query and expect postgresql to use my index:

explain select count(*) from "items" group by "type";

但是查询计划者决定改为使用Seq扫描:

But the query planner decides to use Seq Scan instead:

HashAggregate  (cost=734592.00..734627.90 rows=3590 width=12) (actual time=6477.913..6478.344 rows=3601 loops=1)
  Group Key: type
  ->  Seq Scan on items  (cost=0.00..554593.00 rows=35999800 width=4) (actual time=0.044..1820.522 rows=36000000 loops=1)
Planning time: 0.107 ms
Execution time: 6478.525 ms

没有解释的时间: 5s 979ms

Time without EXPLAIN: 5s 979ms

我从此处尝试了几种解决方案和此处:

I have tried several solutions from here and here:

  • 运行VACUUM ANALYZEVACUUM ANALYZE
  • 配置default_statistics_targetrandom_page_costwork_mem
  • Run VACUUM ANALYZE or VACUUM ANALYZE
  • Configure default_statistics_target, random_page_cost, work_mem

但设置enable_seqscan = OFF没什么帮助:

SET enable_seqscan = OFF;
explain select count(*) from "items" group by "type";

GroupAggregate  (cost=0.56..1114880.46 rows=3590 width=12) (actual time=5.637..5256.406 rows=3601 loops=1)
  Group Key: type
  ->  Index Only Scan using items_type_idx on items  (cost=0.56..934845.56 rows=35999800 width=4) (actual time=0.074..2783.896 rows=36000000 loops=1)
        Heap Fetches: 0
Planning time: 0.103 ms
Execution time: 5256.667 ms

没有解释的时间: 659ms

在我的计算机上使用索引扫描进行查询的速度大约快10倍.

Query with index scan is about 10x faster on my machine.

有没有比设置enable_seqscan更好的解决方案?

Is there a better solution than setting enable_seqscan?

UPD1

我的postgresql版本是9.6.3,work_mem = 4MB(尝试64MB),random_page_cost = 4(尝试1.1),max_parallel_workers_per_gather = 0(尝试4).

My postgresql version is 9.6.3, work_mem = 4MB (tried 64MB), random_page_cost = 4 (tried 1.1), max_parallel_workers_per_gather = 0 (tried 4).

UPD2

我尝试不使用随机数来填充 type 列,而是使用i / 10000来填充pg_stats.correlation = 1-仍为seqscan.

I have tried to fill type column not with random numbers, but with i / 10000 to make pg_stats.correlation = 1 - still seqscan.

UPD3

@jgh是100%正确:

@jgh is 100% right:

通常仅在表的行宽比某些索引宽得多时才会发生

This typically only happens when the table's row width is much wider than some indexes

我制作了大列data,现在postgres使用了索引.谢谢大家!

I've made large column data and now postgres use index. Thanks everyone!

推荐答案

仅索引扫描维基说

重要的是要意识到计划者与 最小化查询的总成本.有了数据库,成本 I/O通常占主导地位.因此,"count(*)不带任何 谓词"查询仅在索引为 比它的桌子小得多.这通常仅在以下情况下发生 该表的行宽度比某些索引的行宽度宽得多..

It is important to realise that the planner is concerned with minimising the total cost of the query. With databases, the cost of I/O typically dominates. For that reason, "count(*) without any predicate" queries will only use an index-only scan if the index is significantly smaller than its table. This typically only happens when the table's row width is much wider than some indexes'.

仅索引扫描仅在计划者认为 据称,它将减少所需的I/O总量 基于成本的建模不完善.这全都取决于能见度 的元组,如果仍然使用索引(即 谓词,等等),如果实际上有可用的索引, 原则上可以用于仅索引扫描

Index-only scans are only used when the planner surmises that that will reduce the total amount of I/O required, according to its imperfect cost-based modelling. This all heavily depends on visibility of tuples, if an index would be used anyway (i.e. how selective a predicate is, etc), and if there is actually an index available that could be used by an index-only scan in principle

因此,您的索引不会被认为明显更小",而是要读取整个数据集,这会导致计划者使用seq扫描

Accordingly, your index is not considered "significantly smaller" and the entire dataset is to be read, which leads the planner in using a seq scan

这篇关于为什么Postgres不在简单的GROUP BY上使用索引?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆