优化Postgres时间戳查询范围 [英] Optimize Postgres timestamp query range

本文介绍了优化Postgres时间戳查询范围的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我定义了以下表和索引:

I have the following table and indices defined:

CREATE TABLE ticket
(
  wid bigint NOT NULL DEFAULT nextval('tickets_id_seq'::regclass),
  eid bigint,
  created timestamp with time zone NOT NULL DEFAULT now(),
  status integer NOT NULL DEFAULT 0,
  argsxml text,
  moduleid character varying(255),
  source_id bigint,
  file_type_id bigint,
  file_name character varying(255),
  status_reason character varying(255),
  ...
)

我在<$上创建了一个索引c $ c>已创建时间戳,如下所示:

I created an index on the created timestamp as follows:

CREATE INDEX ticket_1_idx
  ON ticket
  USING btree
  (created );

这里是我的查询

select * from ticket 
where created between '2012-12-19 00:00:00' and  '2012-12-20 00:00:00'

这个工作正常,直到记录数量开始增长(约500万),现在它将永远回归。

This was working fine until the number of records started to grow (about 5 million) and now it's taking forever to return.

解释分析揭示了这一点:

Explain analyze reveals this:

"Index Scan using ticket_1_idx on ticket  (cost=0.00..10202.64 rows=52543 width=1297) (actual time=0.109..125.704 rows=53340 loops=1)"
"  Index Cond: ((created >= '2012-12-19 00:00:00+00'::timestamp with time zone) AND (created <= '2012-12-20 00:00:00+00'::timestamp with time zone))"
"Total runtime: 175.853 ms"

到目前为止,我已尝试设置

So far I've tried setting

random_page_cost = 1.75 
effective_cache_size = 3 

同时创建

create CLUSTER ticket USING ticket_1_idx;

什么都行不通。我究竟做错了什么?为什么选择顺序扫描?索引应该使查询快速。有什么可以做的来优化吗?

Nothing works. What am I doing wrong? Why is it selecting sequential scan? The indexes are supposed to make the query fast. Anything that can be done to optimize it?

推荐答案

CLUSTER



如果您打算使用 CLUSTER ,则显示的语法无效。

CLUSTER

If you intend to use CLUSTER, the displayed syntax is invalid.

使用ticket_1_idx创建CLUSTER票证;

create CLUSTER ticket USING ticket_1_idx;

运行一次:

CLUSTER ticket USING ticket_1_idx;

这个可以对更大的结果集有很大的帮助。返回的单行并不多。

Postgres会记住用于后续调用的索引。如果您的表格不是只读的,则效果会随着时间的推移而恶化,您需要按一定的时间间隔重新运行:

This can help a lot with bigger result sets. Not so much for a single row returned.
Postgres remembers which index to use for subsequent calls. If your table isn't read-only the effect deteriorates over time and you need to re-run at certain intervals:

CLUSTER ticket;

可能只在易失性分区上。请参阅下文。

Possibly only on volatile partitions. See below.

,如果您有大量更新, CLUSTER (或 VACUUM FULL )实际上可能对性能不利。适量的膨胀允许 UPDATE 将新行版本放在同一数据页面上,并且不需要经常在OS中物理扩展底层文件。您可以使用经过精心调整的 FILLFACTOR 来充分利用这两个方面:

However, if you have lots of updates, CLUSTER (or VACUUM FULL) may actually be bad for performance. The right amount of bloat allows UPDATE to place new row versions on the same data page and avoids the need for physically extending the underlying file in the OS too often. You can use a carefully tuned FILLFACTOR to get the best of both worlds:

  • Fill factor for a sequential index that is PK

CLUSTER 对表进行独占锁定,这可能是多用户环境中的问题。 引用手册:

CLUSTER takes an exclusive lock on the table, which may be a problem in a multi-user environment. Quoting the manual:


当一个表被群集时,一个 ACCESS EXCLUSIVE 锁被获得
。这可以防止任何其他数据库操作(读取和
写入
)在表上运行,直到 CLUSTER 完成。

When a table is being clustered, an ACCESS EXCLUSIVE lock is acquired on it. This prevents any other database operations (both reads and writes) from operating on the table until the CLUSTER is finished.

大胆强调我的。考虑替代 pg_repack

Bold emphasis mine. Consider the alternative pg_repack:


CLUSTER 不同VACUUM FULL 它在线工作,在处理过程中没有对处理过的表格持有
独占锁定。 pg_repack的启动效率为
,其性能可与直接使用 CLUSTER 相媲美。

Unlike CLUSTER and VACUUM FULL it works online, without holding an exclusive lock on the processed tables during processing. pg_repack is efficient to boot, with performance comparable to using CLUSTER directly.

和:


pg_repack需要在重组结束时进行独占锁定。

pg_repack needs to take an exclusive lock at the end of the reorganization.

版本1.3.1适用于:

Version 1.3.1 works with:


PostgreSQL 8.3,8.4,9.0 ,9.1,9.2,9.3,9.4

PostgreSQL 8.3, 8.4, 9.0, 9.1, 9.2, 9.3, 9.4

版本1.4.2适用于:

Version 1.4.2 works with:


PostgreSQL 9.1,9.2,9.3,9.4,9.5,9.6,10

PostgreSQL 9.1, 9.2, 9.3, 9.4, 9.5, 9.6, 10



查询



查询很简单,不会导致任何性能问题。

Query

The query is simple enough not to cause any performance problems per se.

但是,上的一个字正确性 BETWEEN 构建包含边框。您的查询将从12月20日00:00开始选择所有12月19日的 plus 记录。这是极不可能的要求。你真的想要的机会是:

However, a word on correctness: The BETWEEN construct includes borders. Your query selects all of Dec. 19, plus records from Dec. 20, 00:00 hours. That's an extremely unlikely requirement. Chances are, you really want:

SELECT *
FROM   ticket 
WHERE  created >= '2012-12-19 0:0'
AND    created <  '2012-12-20 0:0';



表现



首先,你问:

Performance

First off, you ask:


为什么选择顺序扫描?

Why is it selecting sequential scan?

您的 EXPLAIN 输出清楚地显示索引扫描,而不是顺序表扫描。必定存在某种误解。

Your EXPLAIN output clearly shows an Index Scan, not a sequential table scan. There must be some kind of misunderstanding.

如果你努力争取更好的表现,你可能会有所改善。但是必要的背景信息不在问题中。可能的选项包括:

If you are pressed hard for better performance, you may be able to improve things. But the necessary background information is not in the question. Possible options include:


  • 您只能查询所需的列而不是 * 降低转移成本(以及可能的其他性能优势)。

  • You could only query required columns instead of * to reduce transfer cost (and possibly other performance benefits).

您可以查看 分区 并将实际时间片放入单独的表中。根据需要为分区添加索引。

You could look at partitioning and put practical time slices into separate tables. Add indexes to partitions as needed.

如果分区不是一个选项,另一个相关但较少侵入性的技术是添加一个或多个 部分索引

例如,如果您主要查询当前月份,则可以创建以下部分索引:

If partitioning is not an option, another related but less intrusive technique would be to add one or more partial indexes.
For example, if you mostly query the current month, you could create the following partial index:

CREATE INDEX ticket_created_idx ON ticket(created)
WHERE created >= '2012-12-01 00:00:00'::timestamp;

CREATE 新索引新月开始之前。您可以使用cron作业轻松自动执行任务。
(可选) DROP 过去几个月的部分索引。

CREATE a new index right before the start of a new month. You can easily automate the task with a cron job. Optionally DROP partial indexes for old months later.

另外保留总索引for CLUSTER (不能对部分索引进行操作)。如果旧记录永远不会更改,表分区将对此任务有很大帮助,因为您只需要重新集群较新的分区。
然后如果记录永远不会改变,你可能不需要 CLUSTER

Keep the total index in addition for CLUSTER (which cannot operate on partial indexes). If old records never change, table partitioning would help this task a lot, since you only need to re-cluster newer partitions. Then again if records never change at all, you probably don't need CLUSTER.

如果你结合最后两个步骤,性能应该很棒。

If you combine the last two steps, performance should be awesome.

您可能遗漏了其中一个基础知识。所有通常的性能建议都适用:

You may be missing one of the basics. All the usual performance advice applies:

  • https://wiki.postgresql.org/wiki/Slow_Query_Questions
  • https://wiki.postgresql.org/wiki/Performance_Optimization

这篇关于优化Postgres时间戳查询范围的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆