Postgres多列索引(整数,布尔值和数组) [英] Postgres multi-column index (integer, boolean, and array)

查看:145
本文介绍了Postgres多列索引(整数,布尔值和数组)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Postgres 9.4数据库,其中包含如下表:

  | id | other_id |当前| dn_ids |排名| 
| ---- | ---------- | --------- | ------------------- -------------------- | ------ |
| 1 | 5 | F | {123,234,345,456,111,222,333,444,555} | 1 |
| 2 | 7 | F | {123,100,200,900,800,700,600,400,323} | 2 |

(更新)我已经定义了几个索引。这是 CREATE TABLE 语法:

  CREATE TABLE mytable(
id integer NOT NULL,
other_id integer,
rank integer,
current boolean DEFAULT false,
dn_ids integer [] DEFAULT'{}' :: integer []
);

创建序列mytable_id_seq从1开始增加1没有MINVALUE没有MAXVALUE CACHE 1;

ALTER TABLE ONLY mytable ALTER COLUMN id SET DEFAULT nextval('mytable_id_seq':: regclass);
ALTER TABLE ONLY mytable ADD CONSTRAINT mytable_pkey PRIMARY KEY(id);

CREATE INDEX ind_dn_ids ON mytable使用杜松子酒(dn_ids);
CREATE INDEX index_mytable_on_current ON mytable使用btree(当前);
CREATE INDEX index_mytable_on_other_id ON mytable使用btree(other_id);
CREATE INDEX index_mytable_on_other_id_and_current ON mytable USING btree(other_id,current);

我需要优化这样的查询:



< pre class =lang-sql prettyprint-override> SELECT id,dn_ids
FROM mytable
WHERE other_id = 5 AND current = F AND NOT(ARRAY [100,200]&& ; dn_ids)
ORDER BY排名ASC
LIMIT 500 OFFSET 1000

此查询工作正常,但我确信它可以通过智能索引更快。表中有大约250,000行,我总是将 current = F 作为谓词。我将与存储的数组进行比较的输入数组也将具有1-9个整数。 other_id 可能会有所不同。但一般来说,在限制之前,扫描将在0-25,000行之间匹配。



这是一个示例 EXPLAIN

 限价(成本= 36944.53..36945.78行= 500宽度= 65)
- >排序(成本= 36942.03..37007.42行= 26156宽度= 65)
排序键:排名
- > Seq Scan on mytable(成本= 0.00..35431.42行= 26156宽度= 65)
过滤:((非当前)AND(NOT('{-1,35257,35314}':: integer []& & dn_ids))AND(other_id = 193))

本网站的其他答案和 Postgres docs 表明可以添加复合索引来提高性能。我已经在 [other_id,current] 上有一个。我还在不同的地方读过,除了 WHERE 子句之外,索引还可以提高 ORDER BY 的性能。


  1. 这个查询使用的复合索引的正确类型是什么?我根本不关心空间。


  2. 我在 WHERE 子句?



解决方案



  1. 用于此查询的复合索引的正确类型是什么?我根本不关心空间。


这取决于完整的情况。无论哪种方式,您已经拥有的GIN索引最有可能优于您的GiST索引:





安装附加模块整数列合并/docs/current/interactive/btree-gin.html\"rel =nofollow noreferrer> btree_gin (或 btree_gist





然而,这不包括布尔数据类型,这通常会作为索引列开始没有意义。只有两个(三个包含 NULL )可能的值,它没有足够的选择性。



并且一个普通的btree索引是整数更有效。虽然两个整数列上的多列btree索引肯定有帮助,但你必须仔细测试是否合并(other_id,dn_ids)在多列GIN索引中的价值超过其成本。可能不会。 Postgres可以相当有效地组合位图索引扫描中的多个索引。



最后,虽然索引可以用于排序输出,但这可能不适用于申请您显示的查询(除非您选择表格的大部分)。

不适用于更新的问题。



部分索引可能是一个选项。除此之外,你已经拥有了所需的所有索引



我会在布尔上删除无意义的索引当前完全,而 rank 上的索引可能永远不会用于此查询。



  1. 我在<$ c $中订购条款的方式是否重要c> WHERE 子句?


<$ c $的顺序c> WHERE 条件完全不相关。



问题更新后的附录



索引的效用绑定到选择性条件。如果选择了超过大约5%(取决于各种因素)的表,则整个表的顺序扫描通常比处理任何索引的开销更快 - 除了预排序输出在这种情况下,索引仍然有利于这一点。



对于获取 25,000个250,000 行的查询,索引主要是只是为了这个 - 如果你附上一个 LIMIT 子句,这会变得更有趣。一旦满足 LIMIT ,Postgres就可以停止从索引中获取行。



请注意Postgres总是需要阅读 OFFSET + LIMIT 行,因此性能会随着两者的总和而恶化。



即使您添加了相关信息,相关的大部分内容仍然处于黑暗中。我将假设


  1. 您的谓词 NOT(ARRAY [100,200] && dn_ids) 非常有选择性。除非在 dn_ids 中包含非常少的不同元素,否则排除1到10个ID值通常应保留大部分行。

  2. 最多选择性谓词是 other_id = 5

  3. 大部分行都被消除了 NOT current

    除此之外: current = F 在标准Postgres中的语法无效。必须非当前 current = FALSE ;

虽然GIN索引可以很好地识别匹配数组的少数行比任何其他索引类型更快,但这似乎与您的查询无关。我最好的猜测是部分,多列btree索引

  CREATE INDEX foo on mytable(other_id, rank,dn_ids)
当前不存在;

btree索引中的数组列 dn_ids 不能支持&& 运算符,我只是包含它以允许仅索引扫描并在访问堆(表)之前过滤行。在索引中没有 dn_ids 的情况下甚至可能更快:

  CREATE INDEX foo ON mytable(other_id,rank)WHERE NOT current; 

GiST索引在 Postgres 9.5由于这项新功能


允许GiST索引执行仅索引扫描(Anastasia Lubennikova,
Heikki Linnakangas,Andreas Karlsson)


除此之外: current 是标准SQL中的保留字,即使它在Postgres中被允许作为标识符。

除了2:我假设 id 是一个实际的 serial 列,列默认设置。只是创建一个像你演示的序列,什么都不做。




I have a Postgres 9.4 database with a table like this:

| id | other_id | current | dn_ids                                | rank |
|----|----------|---------|---------------------------------------|------|
| 1  | 5        | F       | {123,234,345,456,111,222,333,444,555} | 1    |
| 2  | 7        | F       | {123,100,200,900,800,700,600,400,323} | 2    |

(update) I already have a couple indexes defined. Here is the CREATE TABLE syntax:

CREATE TABLE mytable (
    id integer NOT NULL,
    other_id integer,
    rank integer,
    current boolean DEFAULT false,
    dn_ids integer[] DEFAULT '{}'::integer[]
);

CREATE SEQUENCE mytable_id_seq START WITH 1 INCREMENT BY 1 NO MINVALUE NO MAXVALUE CACHE 1;

ALTER TABLE ONLY mytable ALTER COLUMN id SET DEFAULT nextval('mytable_id_seq'::regclass);
ALTER TABLE ONLY mytable ADD CONSTRAINT mytable_pkey PRIMARY KEY (id);

CREATE INDEX ind_dn_ids ON mytable USING gin (dn_ids);
CREATE INDEX index_mytable_on_current ON mytable USING btree (current);
CREATE INDEX index_mytable_on_other_id ON mytable USING btree (other_id);
CREATE INDEX index_mytable_on_other_id_and_current ON mytable USING btree (other_id, current);

I need to optimize queries like this:

SELECT id, dn_ids
FROM mytable
WHERE other_id = 5 AND current = F AND NOT (ARRAY[100,200] && dn_ids)
ORDER BY rank ASC
LIMIT 500 OFFSET 1000

This query works fine, but I'm sure it could be much faster with smart indexing. There are about 250,000 rows in the table and I always have current = F as a predicate. The input array I'm comparing to the stored array will have 1-9 integers, as well. The other_id can vary. But generally, before limiting, the scan will match between 0-25,000 rows.

Here's an example EXPLAIN:

Limit  (cost=36944.53..36945.78 rows=500 width=65)
  ->  Sort  (cost=36942.03..37007.42 rows=26156 width=65)
        Sort Key: rank
        ->  Seq Scan on mytable  (cost=0.00..35431.42 rows=26156 width=65)
              Filter: ((NOT current) AND (NOT ('{-1,35257,35314}'::integer[] && dn_ids)) AND (other_id = 193))

Other answers on this site and the Postgres docs suggest it's possible to add a compound index to improve performance. I already have one on [other_id, current]. I've also read in various places that indexing can improve the performance of the ORDER BY in addition to the WHERE clause.

  1. What's the right type of compound index to use for this query? I don't care about space at all.

  2. Does it matter much how I order the terms in the WHERE clause?

解决方案

  1. What's the right type of compound index to use for this query? I don't care about space at all.

This depends on the complete situation. Either way, the GIN index you already have is most probably superior to a GiST index in your case:

You can combine either with integer columns once you install the additional module btree_gin (or btree_gist, respectively).

However, that does not cover the boolean data type, which typically doesn't make sense as index column to begin with. With just two (three incl. NULL) possible values it's not selective enough.

And a plain btree index is more efficient for integer. While a multicolumn btree index on two integer columns would certainly help, you'll have to test carefully if combining (other_id, dn_ids) in a multicolumn GIN index is worth more than it costs. Probably not. Postgres can combine multiple indexes in a bitmap index scan rather efficiently.

Finally, while indexes can be used for sorted output, this will probably not pay to apply for a query like you display (unless you select large parts of the table).
Not applicable to updated question.

Partial indexes might be an option. Other than that, you already have all the indexes you need.

I would drop the pointless index on the boolean column current completely, and the index on just rank is probably never used for this query.

  1. Does it matter much how I order the terms in the WHERE clause?

The order of WHERE conditions is completely irrelevant.

Addendum after question update

The utility of indexes is bound to selective criteria. If more than roughly 5 % (depends on various factors) of the table are selected, a sequential scan of the whole table is typically faster than dealing with the overhead on any indexes - except for pre-sorting output, that's the one thing an index is still good for in such cases.

For a query that fetches 25,000 of 250,000 rows, indexes are mostly just for that - which gets all the more interesting if you attach a LIMIT clause. Postgres can stop fetching rows from an index once the LIMIT is satisfied.

Be aware that Postgres always needs to read OFFSET + LIMIT rows, so performance deteriorate with the sum of both.

Even with your added information, much of what's relevant is still in the dark. I am going to assume that:

  1. Your predicate NOT (ARRAY[100,200] && dn_ids) is not very selective. Ruling out 1 to 10 ID values should typically retain the majority of rows unless you have very few distinct elements in dn_ids.
  2. The most selective predicate is other_id = 5.
  3. A substantial part of the rows is eliminated with NOT current.
    Aside: current = F isn't valid syntax in standard Postgres. Must be NOT current or current = FALSE;

While a GIN index would be great to identify few rows with matching arrays faster than any other index type, this seems hardly relevant for your query. My best guess is this partial, multicolumn btree index:

CREATE INDEX foo ON mytable (other_id, rank, dn_ids)
WHERE NOT current;

The array column dn_ids in a btree index cannot support the && operator, I just include it to allow index-only scans and filter rows before accessing the heap (the table). May even be faster without dn_ids in the index:

CREATE INDEX foo ON mytable (other_id, rank) WHERE NOT current;

GiST indexes may become more interesting in Postgres 9.5 due to this new feature:

Allow GiST indexes to perform index-only scans (Anastasia Lubennikova, Heikki Linnakangas, Andreas Karlsson)

Aside: current is a reserved word in standard SQL, even if it's allowed as identifier in Postgres.
Aside 2: I assume id is an actual serial column with the column default set. Just creating a sequence like you demonstrate, would do nothing.

这篇关于Postgres多列索引(整数,布尔值和数组)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆