选择主键:为什么postgres更喜欢顺序扫描还是索引扫描 [英] Selecting primary key:Why postgres prefers to do sequential scan vs index scan

查看:114
本文介绍了选择主键:为什么postgres更喜欢顺序扫描还是索引扫描的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有下表

create table log
(
    id bigint default nextval('log_id_seq'::regclass) not null
        constraint log_pkey
            primary key,
    level integer,
    category varchar(255),
    log_time timestamp,
    prefix text,
    message text
);

它包含大约300万行。

It contains like 3 million of rows.

我正在比较以下查询:

EXPLAIN SELECT id
        FROM log
        WHERE log_time < now() - INTERVAL '3 month'
        LIMIT 100000

,它产生以下计划:

Limit  (cost=0.00..19498.87 rows=100000 width=8)
  ->  Seq Scan on log  (cost=0.00..422740.48 rows=2168025 width=8)
        Filter: (log_time < (now() - '3 mons'::interval))

并添加了带有ORDER BY id指令的相同查询:

And the same query with ORDER BY id instruction added:

EXPLAIN SELECT id
        FROM log
        WHERE log_time < now() - INTERVAL '3 month'
        ORDER BY id ASC
        LIMIT 100000



which yields

Limit  (cost=0.43..25694.15 rows=100000 width=8)
  ->  Index Scan using log_pkey on log  (cost=0.43..557048.28 rows=2168031 width=8)
        Filter: (log_time < (now() - '3 mons'::interval))

我有以下问题:


  • 缺少ORDER BY指令使Postgres不必关心行的顺序。它们也可以按分类交付。为什么在没有ORDER BY的情况下不使用索引?

  • The absence of ORDER BY instruction allows Postgres not to care about the order of rows. They may be as well delivered sorted. Why it does not use index without ORDER BY?


  • Postgres如何在这样的查询中首先使用索引?查询的 WHERE 子句包含未索引的列,并且要获取该列,将需要顺序数据库扫描,但是使用 ORDER BY <的查询/ code>并不表示。

  • How can Postgres use index in the first place in such a query? WHERE clause of the query contains a non-indexed column and to fetch that column, sequential database scan will be required, but the query with ORDER BY doesn't indicate that.

Postgres手册页上说:

The Postgres manual page says:


对于需要扫描表大部分的查询,显式排序可能比使用索引更快,因为它需要较少的磁盘I / O遵循顺序访问模式

For a query that requires scanning a large fraction of the table, an explicit sort is likely to be faster than using an index because it requires less disk I/O due to following a sequential access pattern


您能为我澄清一下吗?索引始终是有序的。而且读取有序结构总是更快,与读取无序数据然后手动排序相比,它总是一个顺序访问(至少在页面扫描方面)。

Can you please clarify this statement for me? Index is always ordered. And reading an ordered structure is always faster, it is always a sequential access (at least in terms of page scanning) than reading non-ordered data and then ordering it manually.

推荐答案


能否请您为我澄清此声明?索引始终是有序的。而且读取有序结构总是更快,与读取无序数据然后对其进行手动排序相比,它始终是顺序访问(至少在页面扫描方面)。

Can you please clarify this statement for me? Index is always ordered. And reading an ordered structure is always faster, it is always a sequential access (at least in terms of page scanning) than reading non-ordered data and then ordering it manually.

按顺序读取索引,是的,但是postgres需要跟进读取表中的行。也就是说,在大多数情况下,如果索引标识100行,则postgres将需要对该表执行最多100次随机读取。

The index is read sequentially, yes, but postgres needs to follow up with a read of the rows from the table. That is, in most cases, if an index identifies 100 rows, then postgres will need to perform up to 100 random reads against the table.

内部,postgres计划者会权衡顺序读取和随机读取的方式有所不同,随机读取通常要贵得多。设置 seq_page_cost random_page_cost 确定这些设置。如果您有其他设置,您可以查看和修改您想,尽管我建议您对修改做些保守。

Internally, the postgres planner weighs sequential and random reads differently, with random reads generally much more expensive. The settings seq_page_cost and random_page_cost determine those. There are other settings you can view and tinker with if you want, though I recommend being very conservative with modifications.

让我们回到您之前的问题:

Let's go back to your earlier questions:


缺少ORDER BY指令使Postgres不必关心行的顺序。它们也可以按分类交付。为什么在没有ORDER BY的情况下不使用索引?

The absence of ORDER BY instruction allows Postgres not to care about the order of rows. They may be as well delivered sorted. Why it does not use index without ORDER BY?

原因是排序。如您稍后所述,索引不包含约束列,因此使用索引没有任何意义。相反,计划人员基本上是在说读取整个表,找出符合约束的行,然后以找到的顺序返回前100000行。

The reason is the sort. As you note later, the index doesn't include the constraining column, so it doesn't make any sense to use the index. Instead, the planner is basically saying "read the whole table, figure out which rows conform to the constraint, and then return the first 100000 of them, in whatever order we find them".

排序改变了事情。在这种情况下,计划者说我们需要按此字段排序,并且我们已经有一个索引已排序,因此请按索引顺序从表中读取行,对照约束进行检查,直到有100000条记录为止,并且

The sort changes things. In that case, the planner is saying "we need to sort by this field, and we have an index which is already sorted, so read rows from the table in index order, checking against the constraint, until we have 100000 of them, and return that set".

您会注意到,第二个查询的费用估算值(例如'0.43..25694.15')要高得多-计划者认为从索引扫描中进行如此多的随机读取将比不进行任何排序而不是一次读取整个表的成本高得多。

You'll note that the cost estimates (e.g. '0.43..25694.15') are much higher for the second query -- the planner thinks that doing so many random reads from the index scan is going to cost significantly more than just reading the whole table at once with no sorting.

希望有帮助,让我知道如果您还有其他问题。

Hope that helps, and let me know if you have further questions.

这篇关于选择主键:为什么postgres更喜欢顺序扫描还是索引扫描的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆