在PostgreSQL中从搜索中排除过时数据的最佳方法 [英] Best way to exclude outdated data from a search in PostgreSQL

查看:180
本文介绍了在PostgreSQL中从搜索中排除过时数据的最佳方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含以下列的表:




  • 一个名为 id

  • 一个名为的文本列

  • 一个名为<$ c $的时间戳列c> creation_date



目前,已为 id 创建了索引code>和 value 列。



我必须在此表中搜索给定值并想要制作尽可能快地搜索。但我真的不需要查看超过一个月的记录。所以,理想情况下我想将它们从索引中排除。



实现这一目标的最佳方法是:


  1. 执行表格分区。仅搜索适当月份的子表。

  2. 创建仅包含最近记录的部分索引。每个月重新创建一次。

  3. 别的什么?

(PS:最好的解决方案意味着最方便,快速且易于维护的解决方案

解决方案

部分索引



部分索引 将是完美的,甚至是部分的多列索引。但是你的条件


不需要在超过一个月的记录中搜索值


不稳定。部分索引的条件只能用于文字或 IMMUTABLE 函数,即常量值。您提到每月重新创建,但这与您的定义早于一个月不一致。你看到差异吧?



如果你只需要一个当前(或最后一个月),索引重新创建以及查询本身就会变得更加简单! / p>

对于本答复的其余部分,我的定义不超过一个月。我以前不得不处理这样的情况。以下解决方案最适合我:



将索引条件基于固定时间戳并在查询中使用相同的时间戳来说服查询计划程序它可以使用部分索引。这种部分将在很长一段时间内保持有用,只有在添加新行并且旧行从您的时间框架中删除时,其效果才会恶化。该索引将返回越来越多的误报,其中一个额外的 WHERE 子句必须从您的查询中消除。重新创建索引以更新其条件。



给出您的测试表:

  CREATE TABLE mytbl(
value text
,creation_date timestamp
);

创建一个非常简单的 IMMUTABLE SQL函数:

 创建或替换功能f_mytbl_start_ts()
RETURNS时间戳AS
$ func $
SELECT '2013-01-01 0:0':: timestamp
$ func $ LANGUAGE sql IMMUTABLE;

在部分索引的条件下使用该函数:

  CREATE INDEX mytbl_start_ts_idx ON mytbl(value,creation_date)
WHERE(creation_date> = f_mytbl_start_ts());

value 首先出现。 关于dba.SE的相关答案中的说明。

@Igor在评论中的输入让我改进了答案。部分多列索引应该更快地排除部分索引中的误报 - 这是索引条件的本质,它总是越来越过时(但仍然很多比没有它。)



查询



像这样的查询将使用索引,应该是完美的fast:

  SELECT value 
FROM mytbl
WHERE creation_date> = f_mytbl_start_ts() - !
AND creation_date> =(now() - interval'1 month')
AND value ='foo';

看似多余的唯一目的 WHERE 子句: creation_date> = f_mytbl_start_ts()是使查询计划器使用部分索引。



你可以手动删除并重新创建函数和索引。



完全自动化



或者你可以在一个更大的方案中自动化它可能有很多类似的表:



免责声明:这是高级的东西。您需要知道自己在做什么,并考虑用户权限,可能的 SQL注入锁定问题,并且负载很重!



这个指导表在你的政权中每桌收到一行:

 创建表idx_control(
tbl text主键 - 普通,合法的表名!
,start_ts timestamp
);

我会将所有这些元对象放在单独的架构中



对于我们的例子:

  INSERT INTO idx_control(tbl,value)
VALUES('mytbl','2013-1-1 0:0');

指导表提供了额外的好处,您可以对所有这些表及其各自的表格进行概述设置在中心位置,您可以同步更新部分或全部设置。



每当您更改 start_ts 时此表触发以下触发器并完成其余操作:



触发函数:

 创建或替换功能trg_idx_control_upaft()
RETURNS触发器AS
$ func $
DECLARE
_idx text:= NEW.tbl || start_ts_idx;
_func text:='f_'|| NEW.tbl || _start_ts’;
BEGIN

- 删除旧的idx
EXECUTE格式('DROP INDEX IF EXISTS%I',_ idx);

- 创建/更改功能;保留占位符-infinity为NULL时间戳
EXECUTE格式('
CREATE OR REPLACE FUNCTION%I()
RETURNS时间戳AS
$ x $
SELECT%L: :timestamp
$ x $ LANGUAGE SQL IMMUTABLE',_ func,COALESCE(NEW.start_ts,' - infinity'));

- 新指数; NULL时间戳删除idx条件:
如果NEW.start_ts IS NULL那么
EXECUTE格式('
CREATE INDEX%I ON%I(value,creation_date)',_ idx,NEW.tbl);
ELSE
EXECUTE格式('
CREATE INDEX%I ON%I(value,creation_date)
WHERE creation_date> =%I()',_ idx,NEW.tbl, _func);
END IF;

RETURN NULL;

END
$ func $ LANGUAGE plpgsql;

触发:

  CREATE TRIGGER upaft 
更新后idx_control
FOR EACH ROW
WHEN(OLD.start_ts与NEW.start_ts不同)
EXECUTE PROCEDURE trg_idx_control_upaft();

现在,转向上有一个简单的 UPDATE 表校准索引和函数:

  UPDATE idx_control 
SET start_ts ='2013-03-22 0:0'
WHERE tbl ='mytbl';

您可以运行cron作业或手动调用它。

查询使用索引不会改变。



- > SQLfiddle

我用一个10k行的小测试用例更新了小提琴以证明它有效。
PostgreSQL甚至会为我的示例查询执行仅索引扫描。不会比这更快。


I have a table containing the following columns:

  • an integer column named id
  • a text column named value
  • a timestamp column named creation_date

Currently, indexes have been created for the id and value columns.

I must search this table for a given value and want to make search as fast as I can. But I don't really need to look through records that are older than one month. So, ideally I would like to exclude them from the index.

What would be the best way to achieve this:

  1. Perform table partitioning. Only search through the subtable for the appropriate month.
  2. Create a partial index including only the recent records. Recreate it every month.
  3. Something else?

(PS.: "the best solution" means the solution that is the most convenient, fast and easy to maintain)

解决方案

Partial index

A partial index would be perfect for that, or even a partial multicolumn index. But your condition

don't need to search value in records older than one month

is not stable. The condition of a partial index can only work with literals or IMMUTABLE functions, i.e., constant values. You mention Recreate it every month, but that would not agree with your definition older than one month. You see the difference right?

If you should only need a the current (or last) month, index recreation as well as the query itself become quite a bit simpler!

I'll got with your definition "not older than one month" for the rest of this answer. I had to deal with situations like this before. The following solution worked best for me:

Base your index conditions on a fixed timestamp and use the same timestamp in your queries to convince the query planner it can use the partial index. This kind of partial will stay useful over an extended period of time, only its effectiveness deteriorates as new rows are added and older rows drop out of your time frame. The index will return more and more false positives that an additional WHERE clause has to eliminate from your query. Recreate the index to update its condition.

Given your test table:

CREATE TABLE mytbl (
   value text
  ,creation_date timestamp
);

Create a very simple IMMUTABLE SQL function:

CREATE OR REPLACE FUNCTION f_mytbl_start_ts()
  RETURNS timestamp AS
$func$
SELECT '2013-01-01 0:0'::timestamp
$func$ LANGUAGE sql IMMUTABLE;

Use the function in the condition of the partial index:

CREATE INDEX mytbl_start_ts_idx ON mytbl(value, creation_date)
WHERE (creation_date >= f_mytbl_start_ts());

value comes first. Explanation in this related answer on dba.SE.
Input from @Igor in the comments made me improve my answer. A partial multicolumn index should make ruling out false positives from the partial index faster - it's in the nature of the index condition that it's always increasingly outdated (but still a lot better than not having it).

Query

A query like this will make use of the index and should be perfectly fast:

SELECT value
FROM   mytbl
WHERE  creation_date >= f_mytbl_start_ts()            -- !
AND    creation_date >= (now() - interval '1 month')
AND    value = 'foo';

The only purpose of the seemingly redundant WHERE clause: creation_date >= f_mytbl_start_ts() is to make the query planner use the partial index.

You can drop and recreate function and index manually.

Full automation

Or you can automate it in a bigger scheme with possibly lots of similar tables:

Disclaimer: This is advanced stuff. You need to know what you are doing and consider user privileges, possible SQL injection and locking issues with heavy concurrent load!

This "steering table" receives a line per table in your regime:

CREATE TABLE idx_control (
   tbl text primary key  -- plain, legal table names!
  ,start_ts timestamp
);

I would put all such meta objects in a separate schema.

For our example:

INSERT INTO idx_control(tbl, value)
VALUES ('mytbl', '2013-1-1 0:0');

A "steering table" offers the additional benefit that you have an overview over all such tables and their respective settings in a central place and you can update some or all of them in sync.

Whenever you change start_ts in this table the following trigger kicks in and takes care of the rest:

Trigger function:

CREATE OR REPLACE FUNCTION trg_idx_control_upaft()
  RETURNS trigger AS
$func$
DECLARE
   _idx  text := NEW.tbl || 'start_ts_idx';
   _func text := 'f_' || NEW.tbl || '_start_ts';
BEGIN

-- Drop old idx
EXECUTE format('DROP INDEX IF EXISTS %I', _idx);

-- Create / change function; Keep placeholder with -infinity for NULL timestamp
EXECUTE format('
CREATE OR REPLACE FUNCTION %I()
  RETURNS timestamp AS
$x$
SELECT %L::timestamp
$x$ LANGUAGE SQL IMMUTABLE', _func, COALESCE(NEW.start_ts, '-infinity'));

-- New Index; NULL timestamp removes idx condition:    
IF NEW.start_ts IS NULL THEN 
   EXECUTE format('
   CREATE INDEX  %I ON %I (value, creation_date)', _idx, NEW.tbl);
ELSE
   EXECUTE format('
   CREATE INDEX  %I ON %I (value, creation_date)
   WHERE  creation_date >= %I()', _idx, NEW.tbl, _func);
END IF;

RETURN NULL;

END
$func$ LANGUAGE plpgsql;

Trigger:

CREATE TRIGGER upaft
AFTER UPDATE ON idx_control
FOR EACH ROW
WHEN (OLD.start_ts IS DISTINCT FROM NEW.start_ts)
EXECUTE PROCEDURE trg_idx_control_upaft();

Now, a simple UPDATE on the steering table calibrates index and function:

UPDATE idx_control
SET    start_ts = '2013-03-22 0:0'
WHERE  tbl = 'mytbl';

You can run a cron job or call this manually.
Queries using the index don't change.

-> SQLfiddle.
I updated the fiddle with a small test case of 10k rows to demonstrate it works. PostgreSQL will even do an index-only scan for my example query. Won't get any faster than this.

这篇关于在PostgreSQL中从搜索中排除过时数据的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆