选择正确的分区规则 [英] Choosing the right partitioning rule

查看:96
本文介绍了选择正确的分区规则的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在建立一个新的PostgreSQL 9数据库,其中将包含数百万(或数十亿)行。因此,我决定使用PostgreSQL继承对数据进行分区。

I am setting up a new PostgreSQL 9 database that will contains millions (or maybe billions) of rows. So I decided to partition data using PostgreSQL inheritance.

我创建了这样的主表(例如,经过简化):

I created a master table like this (simplified for example) :

CREATE TABLE mytable
(
  user_id integer,
  year integer,
  CONSTRAINT pk_mytable PRIMARY KEY (user_id, year)
);

和10个分区表:

CREATE TABLE mytable_0 () INHERITS (mytable);
CREATE TABLE mytable_1 () INHERITS (mytable);
...
CREATE TABLE mytable_9 () INHERITS (mytable);

我知道总是使用唯一的user_id条件从应用程序访问行。
因此,我想使用基于user_id的规则将数据相当大地分布在10个表上。

I know that rows will always be accessed from the application using a unique user_id condition. So I would like to spread data "quite" equally over the 10 tables using a rule based on user_id.

要调整主表上的查询,我的第一个想法是使用模数检查约束:

To tune queries over the master table, my first idea was to use a modulus check constraint :

ALTER TABLE mytable_0 ADD CONSTRAINT mytable_user_id_check CHECK (user_id % 10 = 0);
ALTER TABLE mytable_1 ADD CONSTRAINT mytable_user_id_check CHECK (user_id % 10 = 1);
...

问题是,当我查询主表 mytable时,在user_id上的条件下,PostgreSQL分析器检查所有表并且没有从检查约束中受益:

The problem is, when I query the master table "mytable" with the condition on user_id, PostgreSQL analyzer check all the tables and do not benefit from the check constraint :

EXPLAIN SELECT * FROM mytable WHERE user_id = 12345;

"Result  (cost=0.00..152.69 rows=64 width=36)"
"  ->  Append  (cost=0.00..152.69 rows=64 width=36)"
"        ->  Seq Scan on mytable  (cost=0.00..25.38 rows=6 width=36)"
"              Filter: (user_id = 12345)"
"        ->  Seq Scan on mytable_0 mytable  (cost=0.00..1.29 rows=1 width=36)"
"              Filter: (user_id = 12345)"
"        ->  Seq Scan on mytable_1 mytable  (cost=0.00..1.52 rows=1 width=36)"
"              Filter: (user_id = 12345)"
...
"        ->  Seq Scan on mytable_9 mytable  (cost=0.00..1.52 rows=1 width=36)"
"              Filter: (user_id = 12345)"

如果我使用这样的经典CHECK CONSTRAINT(以及与该规则匹配的分区):

Whereas if I use a classic CHECK CONSTRAINT like this (and the repartition that match that rule) :

ALTER TABLE mytable_0 ADD CONSTRAINT mytable_user_id_check CHECK (user_id BETWEEN 1 AND 10000);
ALTER TABLE mytable_1 ADD CONSTRAINT mytable_user_id_check CHECK (user_id BETWEEN 10001 AND 20000);
...

它将仅扫描符合条件的表(在此示例):

it will scan only tables that match the condition (mytable and mytable_1 in this example) :

"Result  (cost=0.00..152.69 rows=64 width=36)"
"  ->  Append  (cost=0.00..152.69 rows=64 width=36)"
"        ->  Seq Scan on mytable  (cost=0.00..25.38 rows=6 width=36)"
"              Filter: (user_id = 12345)"
"        ->  Seq Scan on mytable_1 mytable  (cost=0.00..1.52 rows=1 width=36)"
"              Filter: (user_id = 12345)"

但是使用这种检查约束很难维护,因为将填充表中的用户范围多年来会改变。首先是成千上万,在不久的将来可能是数百万甚至更多...

But using such check constraint is difficult to maintain because the range of users that will be populated in the tables will change over the years. thousands first, maybe millions or more in the near future...

我可以使用什么规则将我的数据平均分配到10个可以受益于检查约束的表上这样在主表上的SELECT将只扫描正确的表...?

What rule could I use to partition equally my data over the 10 tables that could benefit from a check constraint so that a SELECT on the master table will scan only the right table...?

谢谢,

Nico

推荐答案

限制仅限于计划程序,而不是分区本身。手册中对此进行了详细介绍:

The limitation is with the planner rather than the partioning itself. It's covered in the manual in some detail:

http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html

有两个

首先,您说所有访问都将通过主键进行。这意味着您将无法从分区中获得任何性能优势(至少在正常使用情况下不会)。每个分区上的索引将较小,但是PG需要选择首先检查哪个分区。如果需要重新索引或类似的操作,您将获得好处-您可以分别为每个分区重新索引。

Firstly, you say all access will be through the primary key. This means you will gain no performance benefits from partitioning (at least not in normal usage). The index on each partition will be smaller, but PG needs to pick which partition to check first. Where you will gain is if you need to reindex or similar - you can reindex each partition separately.

第二,您说可能有数千行到数十亿行。这使我得出两个结论:

Secondly, you say you might have anything from thousands to billions of rows. This leads me to two conclusions:


  1. 也许将决定留给以后。等待直到需要分区。

  2. 您不太可能想要正好有10个分区,两千行和二十亿个分区。

如果要进行分区,请按范围进行操作-例如100,000行或每个分区100万行。添加一个cron-job来检查使用的最大ID,并在需要时创建一个新分区(也许一天一次)。

If you are going to partition, do it by range - say 100,000 rows or 1 million per partition. Add a cron-job to check the maximum ID used and create a new partition if required (once a day perhaps).

个人而言,我将其保留直到我需要如果您认为一个分区很可能比以后更需要,则也许有一个分区。

Personally, though, I'd leave it until I needed it. Maybe have a single partition as a catch-all if you think it's more likely than not you'll need them later.

这篇关于选择正确的分区规则的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆