优化 Postgres 时间戳查询范围 [英] Optimize Postgres timestamp query range

本文介绍了优化 Postgres 时间戳查询范围的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我定义了下表和索引:

CREATE TABLE ticket
(
  wid bigint NOT NULL DEFAULT nextval('tickets_id_seq'::regclass),
  eid bigint,
  created timestamp with time zone NOT NULL DEFAULT now(),
  status integer NOT NULL DEFAULT 0,
  argsxml text,
  moduleid character varying(255),
  source_id bigint,
  file_type_id bigint,
  file_name character varying(255),
  status_reason character varying(255),
  ...
)

我在 created 时间戳上创建了一个索引,如下所示:

I created an index on the created timestamp as follows:

CREATE INDEX ticket_1_idx
  ON ticket
  USING btree
  (created );

这是我的查询

select * from ticket 
where created between '2012-12-19 00:00:00' and  '2012-12-20 00:00:00'

这一直运行良好,直到记录数量开始增长(大约 500 万条),现在需要很长时间才能恢复.

This was working fine until the number of records started to grow (about 5 million) and now it's taking forever to return.

解释分析揭示了这一点:

Explain analyze reveals this:

"Index Scan using ticket_1_idx on ticket  (cost=0.00..10202.64 rows=52543 width=1297) (actual time=0.109..125.704 rows=53340 loops=1)"
"  Index Cond: ((created >= '2012-12-19 00:00:00+00'::timestamp with time zone) AND (created <= '2012-12-20 00:00:00+00'::timestamp with time zone))"
"Total runtime: 175.853 ms"

到目前为止我已经尝试过设置

So far I've tried setting

random_page_cost = 1.75 
effective_cache_size = 3 

也创建了

create CLUSTER ticket USING ticket_1_idx;

没有任何效果.我究竟做错了什么?为什么选择顺序扫描?索引应该使查询快速.有什么可以优化的吗?

Nothing works. What am I doing wrong? Why is it selecting sequential scan? The indexes are supposed to make the query fast. Anything that can be done to optimize it?

推荐答案

CLUSTER

如果您打算使用CLUSTER,则显示的语法无效.

使用 ticket_1_idx 创建 CLUSTER 票证;

运行一次:

CLUSTER ticket USING ticket_1_idx;

可以对更大的结果集有很大帮助.返回的单行不是那么多.
Postgres 记住用于后续调用的索引.如果您的表不是只读的,效果会随着时间的推移而恶化,您需要以特定时间间隔重新运行:

This can help a lot with bigger result sets. Not so much for a single row returned.
Postgres remembers which index to use for subsequent calls. If your table isn't read-only the effect deteriorates over time and you need to re-run at certain intervals:

CLUSTER ticket;

可能仅在易失分区上.见下文.

Possibly only on volatile partitions. See below.

然而,如果您有大量更新,CLUSTER(或VACUUM FULL)实际上可能对性能不利.适量的膨胀允许 UPDATE 将新的行版本放置在同一数据页上,并避免在操作系统中过于频繁地物理扩展底层文件的需要.您可以使用经过仔细调整的 FILLFACTOR 来两全其美:

However, if you have lots of updates, CLUSTER (or VACUUM FULL) may actually be bad for performance. The right amount of bloat allows UPDATE to place new row versions on the same data page and avoids the need for physically extending the underlying file in the OS too often. You can use a carefully tuned FILLFACTOR to get the best of both worlds:

CLUSTER 对表采取排他锁,这在多用户环境中可能是一个问题.引用手册:

CLUSTER takes an exclusive lock on the table, which may be a problem in a multi-user environment. Quoting the manual:

当表被集群时,获取ACCESS EXCLUSIVE锁在上面.这可以防止任何其他数据库操作(读取和写入)从对表的操作直到 CLUSTER 完成.

When a table is being clustered, an ACCESS EXCLUSIVE lock is acquired on it. This prevents any other database operations (both reads and writes) from operating on the table until the CLUSTER is finished.

粗体强调我的.考虑替代pg_repack:

Bold emphasis mine. Consider the alternative pg_repack:

CLUSTERVACUUM FULL 不同,它在线工作,无需持有处理期间对已处理表的排他锁.pg_repack 是启动效率高,性能堪比直接使用 CLUSTER.

Unlike CLUSTER and VACUUM FULL it works online, without holding an exclusive lock on the processed tables during processing. pg_repack is efficient to boot, with performance comparable to using CLUSTER directly.

和:

pg_repack 需要在重组结束时采取排他锁.

pg_repack needs to take an exclusive lock at the end of the reorganization.

1.3.1 版适用于:

Version 1.3.1 works with:

PostgreSQL 8.3、8.4、9.0、9.1、9.2、9.3、9.4

PostgreSQL 8.3, 8.4, 9.0, 9.1, 9.2, 9.3, 9.4

1.4.2 版适用于:

Version 1.4.2 works with:

PostgreSQL 9.1、9.2、9.3、9.4、9.5、9.6、10

PostgreSQL 9.1, 9.2, 9.3, 9.4, 9.5, 9.6, 10

查询

查询足够简单,不会导致任何性能问题.

Query

The query is simple enough not to cause any performance problems per se.

然而,关于正确性的一句话:BETWEEN 构造包括边框.您的查询选择了 12 月 19 日,加上 12 月 20 日 00:00 的所有记录.这是极不可能的要求.很有可能,您真的想要:

However, a word on correctness: The BETWEEN construct includes borders. Your query selects all of Dec. 19, plus records from Dec. 20, 00:00 hours. That's an extremely unlikely requirement. Chances are, you really want:

SELECT *
FROM   ticket 
WHERE  created >= '2012-12-19 0:0'
AND    created <  '2012-12-20 0:0';

性能

首先,你问:

为什么选择顺序扫描?

您的EXPLAIN 输出清楚地显示了索引扫描,而不是顺序表扫描.肯定有什么误会.

Your EXPLAIN output clearly shows an Index Scan, not a sequential table scan. There must be some kind of misunderstanding.

如果您为获得更好的表现而努力,您也许可以改进.但必要的背景信息不在问题中.可能的选项包括:

If you are pressed hard for better performance, you may be able to improve things. But the necessary background information is not in the question. Possible options include:

  • 您只能查询必需的列而不是 * 以降低传输成本(以及可能的其他性能优势).

  • You could only query required columns instead of * to reduce transfer cost (and possibly other performance benefits).

您可以查看分区 并将实际时间片放入单独的表格中.根据需要向分区添加索引.

You could look at partitioning and put practical time slices into separate tables. Add indexes to partitions as needed.

如果分区不是一种选择,另一种相关但侵入性较小的技术是添加一个或多个 部分索引.
例如,如果您主要查询当前月份,则可以创建以下部分索引:

If partitioning is not an option, another related but less intrusive technique would be to add one or more partial indexes.
For example, if you mostly query the current month, you could create the following partial index:

CREATE INDEX ticket_created_idx ON ticket(created)
WHERE created >= '2012-12-01 00:00:00'::timestamp;

CREATE 在新的一个月开始之前 创建一个新索引.您可以使用 cron 作业轻松自动执行任务.可选的 DROP 部分索引,用于旧月之后.

CREATE a new index right before the start of a new month. You can easily automate the task with a cron job. Optionally DROP partial indexes for old months later.

另外保留CLUSTER的总索引(不能对部分索引进行操作).如果旧记录永远不会改变,表分区将大大有助于此任务,因为您只需要重新集群较新的分区.再说一次,如果记录从不改变,你可能不需要 CLUSTER.

Keep the total index in addition for CLUSTER (which cannot operate on partial indexes). If old records never change, table partitioning would help this task a lot, since you only need to re-cluster newer partitions. Then again if records never change at all, you probably don't need CLUSTER.

如果你结合最后两个步骤,性能应该很棒.

If you combine the last two steps, performance should be awesome.

您可能缺少其中一项基础知识.所有通常的性能建议都适用:

You may be missing one of the basics. All the usual performance advice applies:

这篇关于优化 Postgres 时间戳查询范围的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆