PostgreSQL 如何使用字段上的 b 树索引执行 ORDER BY? [英] How does PostgreSQL perform ORDER BY with a b-tree index on the field?

查看:17
本文介绍了PostgreSQL 如何使用字段上的 b 树索引执行 ORDER BY?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一张表 bsort:

CREATE TABLE bsort(a int, data text);

此处data 可能不完整.换句话说,有些元组可能没有 data 值.

Here data may be incomplete. In other words, some tuples may not have data value.

然后我在表上建立一个b-tree索引:

And then I build a b-tree index on the table:

CREATE INDEX ON bsort USING BTREE(a);

现在,如果我执行此查询:

Now if I perform this query:

SELECT * FROM bsort ORDER BY a;

PostgreSQL 是对复杂度为 nlogn 的元组进行排序,还是直接从 b-tree 索引中获取顺序?

Does PostgreSQL sort tuples with nlogn complexity, or does it get the order directly from the b-tree index?

推荐答案

对于像这样的简单查询,Postgres 将使用 索引扫描并按顺序从索引中检索容易排序的元组.由于其 MVCC 模型,Postgres 必须始终访问堆";(数据页)另外用于验证条目是否对当前事务实际可见.引用 Postgres Wiki 上的仅索引扫描:

For a simple query like this Postgres will use an index scan and retrieve readily sorted tuples from the index in order. Due to its MVCC model Postgres had to always visit the "heap" (data pages) additionally to verify entries are actually visible to the current transaction. Quoting the Postgres Wiki on index-only scans:

PostgreSQL 索引不包含可见性信息.这就对了不能直接确定任何给定的元组是否可见当前交易,这就是为什么它需要这么长时间才能只使用索引要实施的扫描.

PostgreSQL indexes do not contain visibility information. That is, it is not directly possible to ascertain if any given tuple is visible to the current transaction, which is why it has taken so long for index-only scans to be implemented.

最终发生在 9.2 版本中:仅索引扫描.手册:

Which finally happened in version 9.2: index-only scans. The manual:

如果索引存储原始索引数据值(而不是一些它们的有损表示),支持 很有用index-only scans,其中索引返回实际数据而不仅仅是TID堆元组.如果可见性图显示,这只会避免 I/OTID 在一个全可见的页面上;否则堆元组必须是无论如何访问以检查MVCC可见性.但这并不关心访问方法.

If the index stores the original indexed data values (and not some lossy representation of them), it is useful to support index-only scans, in which the index returns the actual data not just the TID of the heap tuple. This will only avoid I/O if the visibility map shows that the TID is on an all-visible page; else the heap tuple must be visited anyway to check MVCC visibility. But that is no concern of the access method's.

可见性地图决定是否可以进行仅索引扫描.如果所有涉及的列值都包含在索引中,则仅是一个选项.否则,无论如何都必须(另外)访问堆.仍然不需要排序步骤.

这就是为什么我们现在有时会将无用的列附加到索引中.就像示例中的 data 列:

That's why we sometimes append otherwise useless columns to indexes now. Like the data column in your example:

CREATE INDEX ON bsort (a, data);  -- btree is the default index type

它使索引更大(取决于)并且维护和用于其他目的的成本更高.因此,如果您从中获得仅索引扫描,则仅附加 data 列.索引中列的顺序很重要:

It makes the index bigger (depends) and a bit more expensive to maintain and use for other purposes. So only append the data column if you get index-only scans out of it. The order of columns in the index is important:

从 Postgres 11 开始,还有覆盖索引"使用 INCLUDE 关键字.喜欢:

Since Postgres 11, there are also "covering indexes" with the INCLUDE keyword. Like:

CREATE INDEX ON bsort (a) INCLUDE (data);

见:

仅索引扫描的好处,每个文档:

如果知道页面上的所有元组都是可见的,堆取可以跳过.这在大型数据集上最为明显,其中可见性映射可以防止磁盘访问.能见度图是巨大的比堆小,因此即使在堆中也可以轻松缓存很大.

If it's known that all tuples on the page are visible, the heap fetch can be skipped. This is most noticeable on large data sets where the visibility map can prevent disk accesses. The visibility map is vastly smaller than the heap, so it can easily be cached even when the heap is very large.

可见性地图由 VACUUM 如果您有 autovacuum 正在运行(现代 Postgres 中的默认设置).详情:

The visibility map is maintained by VACUUM which happens automatically if you have autovacuum running (the default setting in modern Postgres). Details:

但是在对表的写入操作和下一次 VACUUM 运行之间存在一些延迟.要点:

But there is some delay between write operations to the table and the next VACUUM run. The gist of it:

  • 只读表在清空后随时可以进行仅索引扫描.
  • 被修改的数据页丢失了它们的全部可见"可见性映射中的标志,直到下一个 VACUUM(以及所有较旧的交易完成),因此它取决于写入操作和 VACUUM 频率之间的比率.
  • Read-only tables stay ready for index-only scans once vacuumed.
  • Data pages that have been modified lose their "all-visible" flag in the visibility map until the next VACUUM (and all older tansactions being finished), so it depends on the ratio between write operations and VACUUM frequency.

如果某些涉及的页面被标记为全部可见,则部分仅索引扫描仍然是可能的.但是,如果无论如何都必须访问堆,则访问方法索引扫描"将无效.便宜一点.因此,如果当前有太多页面脏了,Postgres 将完全切换到更便宜的索引扫描.再次访问 Postgres Wiki:

Partial index-only scans are still possible if some of the involved pages are marked all-visible. But if the heap has to be visited anyway, the access method "index scan" is a bit cheaper. So if too many pages are currently dirty, Postgres will switch to the cheaper index scan altogether. The Postgres Wiki again:

作为预计的堆获取(或访问")的数量计划者需要上升,计划者最终会得出结论仅索引扫描是不可取的,因为它不是最便宜的根据其成本模型可能的计划.index-only 的值扫描完全在于它们允许我们省略堆访问的潜力(如果只是部分)并最小化 I/O.

As the number of heap fetches (or "visits") that are projected to be needed by the planner goes up, the planner will eventually conclude that an index-only scan isn't desirable, as it isn't the cheapest possible plan according to its cost model. The value of index-only scans lies wholly in their potential to allow us to elide heap access (if only partially) and minimise I/O.

这篇关于PostgreSQL 如何使用字段上的 b 树索引执行 ORDER BY?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆