表分区以获得最大速度? [英] Table partitioning for maximum speed?

查看:51
本文介绍了表分区以获得最大速度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我确定这是一个已经在这里探索过的概念。我有一张桌子

(相当简单,只有两列,其中一列是32位校验和)

有几百万行(目前大约有700万行)。我们每天大约一百万




从my_table中选择*,其中md5 =?


来验证行的存在与否,以及进一步处理

的信息。


现在的想法是将此表分为16(或256) ,

或...)按第一个数字(或2,或......)的块。在最简单的情况下,这个

意味着:


创建表my_table_0作为select * from my_table,其中md5喜欢''0%';


创建表my_table_1作为select * from my_table,其中md5喜欢''1%'';


....


创建表my_table_f作为select * from my_table,其中md5喜欢''f%'';

然后更改代码以检查校验和并创建查询到

基于第一个数字的适当表格。


显然,这在概念上与md5

列上的索引相似为我们做。但是,分区只会将数据库服务器上的一小部分处理负载移动到运行该应用程序的

机器上。这很重要,因为随着负载的增加我们可以买得起更多的应用程序机器,但是我们不能轻易地升级数据库服务器。


针对一个包含50万行的表的查询是否会超过查询一个包含700万行的
表的查询,这使得值得麻烦的是

支持15额外桌子?


-

Jeff Boes vox 269.226.9550转24

数据库工程师传真269.349.9076

Nexcerpt,Inc。 http://www.nexcerpt.com

... Nexcerpt ...扩展你的专业知识

I''m sure this is a concept that''s been explored here. I have a table
(fairly simple, just two columns, one of which is a 32-digit checksum)
with several million rows (currently, about 7 million). About a million
times a day we do

select * from my_table where md5 = ?

to verify presence or absence of the row, and base further processing on
that information.

The idea bandied about now is to partition this table into 16 (or 256,
or ...) chunks by first digit (or 2, or ...). In the simplest case, this
would mean:

create table my_table_0 as select * from my_table where md5 like ''0%'';

create table my_table_1 as select * from my_table where md5 like ''1%'';

....

create table my_table_f as select * from my_table where md5 like ''f%'';
Then change the code to examine the checksum and create a query to the
appropriate table based on the first digit.

Obviously, this is conceptually similar to what the index on the "md5"
column is supposed to do for us. However, partitioning moves just a
little of the processing load off the database server and onto the
machine running the application. That''s important, because we can afford
more application machines as load increases, but we can''t as easily
upgrade the database server.

Will a query against a table of 0.5 million rows beat a query against a
table of 7 million rows by a margin that makes it worth the hassle of
supporting 15 "extra" tables?

--
Jeff Boes vox 269.226.9550 ext 24
Database Engineer fax 269.349.9076
Nexcerpt, Inc. http://www.nexcerpt.com
...Nexcerpt... Extend your Expertise

推荐答案

2003年10月9日星期四18点: 37:19 +0000,

Jeff Boes< jb *** @ nexcerpt.com>写道:
On Thu, Oct 09, 2003 at 18:37:19 +0000,
Jeff Boes <jb***@nexcerpt.com> wrote:
我相信这是一个已经在这里探索过的概念。我有一张表
(相当简单,只有两列,其中一列是32位校验和)
有几百万行(目前约有700万行)。我们每天大约有一百万次

从my_table中选择*,其中md5 =?

来验证行的存在与否,并进一步处理
那些信息。

现在的想法是将这个表分成16个(或256个,
或......)块的第一个数字(或2,或。 ..)。在最简单的情况下,这意味着:
I''m sure this is a concept that''s been explored here. I have a table
(fairly simple, just two columns, one of which is a 32-digit checksum)
with several million rows (currently, about 7 million). About a million
times a day we do

select * from my_table where md5 = ?

to verify presence or absence of the row, and base further processing on
that information.

The idea bandied about now is to partition this table into 16 (or 256,
or ...) chunks by first digit (or 2, or ...). In the simplest case, this
would mean:




如果校验和列上有索引,那么你不应该得到

通过分区数据加快速度。

如果你没有校验和索引,听起来应该是这样。


---------------------------(广播结束)--------------- ------------

提示1:订阅和取消订阅命令转到 ma*******@postgresql.org



If there is an index on the checksum column, then you shouldn''t get
much of a speed up by partitioning the data.
If you don''t have an index on the checksum, it sounds like you should.

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org




----- BEGIN PGP签名消息-----

哈希:SHA1


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

这很重要,因为我们可以负担得起更多的应用程序
机器负载增加,但我们不能轻易升级数据库服务器。
That''s important, because we can afford more application
machines as load increases, but we can''t as easily
upgrade the database server.




想到两个想法:


加快速度的一种方法是转换整个校验和。考虑

md5校验和到底是什么:一个表示十六进制的文本字符串
。将其存储为TEXT或CHAR不如直接将其存储为

数字。让您的应用程序将其转换为十进制数,

然后将校验和存储为数据库中的NUMERIC类型。这给了

立即加速。接下来,使用部分索引来进一步提高速度

。您要创建多少部分索引取决于您选择更新的比率,以及每个部分对您的重要程度。我做的一些快速

统计分析表明,对于10个指数,神奇数字

大概在3.402 x 10 ^ 37左右。换句话说:


CREATE TABLE md5check(id SERIAL,md5 NUMERIC);


CREATE INDEX md5_i0 ON md5check(md5)WHERE

md5< = 34000000000000000000000000000000000000;

CREATE INDEX md5_i1 ON md5check(md5)WHERE

md5> 34000000000000000000000000000000000000 AND

md5< = 68000000000000000000000000000000000000;

CREATE INDEX md5_i2 ON md5check(md5)WHERE

md5> 68000000000000000000000000000000000000 AND

md5< = 102000000000000000000000000000000000000;

....

CREATE INDEX md5_i10 ON md5check(md5)WHERE

md5> 340000000000000000000000000000000000;


在我的测试表上有20万行,我看到从
.16毫秒(仅使用TEXT)到的速度加快。 09毫秒您创建的部分索引越多,事情就会越快。只需记住将

上下边界索引放在适当位置即可捕获所有内容。


旁白:如果您只是测试行的存在,

你可以拉回常数而不是整行:


SELECT 1 FROM md5check WHERE md5 =?

另一种方式加快速度是将校验和分解为部分

,这样我们就可以使用其中一个正常的。数据类型:具体来说,BIGINT。

将32个字符的校验和分成四个部分,将每个部分

转换为十进制数,并将每个校验和存储在自己的BIGINT列中。好的

新闻用这种方式就是你只需要一个列上的索引。

即使是700万以上,匹配的数量为1/4校验和

字符小到不需要额外的索引。


CREATE TABLE md5check

(id SERIAL,md1 BIGINT, md2 BIGINT,md3 BIGINT,md4 BIGINT);


CREATE INDEX md5_i1 ON md5check(md1);


您还可以添加部分索引到这也是最大的速度。

- -

Greg Sabino Mullane gr ** @ turnstep.com

PGP密钥:0x14964AC8 200310101135

-----开始PGP签名-----

评论: http://www.turnstep.com/pgp.html


iD8DBQE / ht6DvJuQZxSWSsgRAjvyAJ9ndadWAgJIm84dc / kB8RABEIzIbwCg1UJL

2VUQeQU + LMgXnumOoMT6kWk =

= PeUQ

- ----结束PGP SIGNATURE -----


---------------------------(广播结束)------- --------------------

提示8:解释分析是你的朋友



Two ideas come to mind:

One way to speed things up is to convert the entire checksum. Consider
what a md5 checksum really is: a text string representing a hexadecimal
number. Storing it as TEXT or CHAR is not as good as storing it as a
number directly. Have your application convert it to a decimal number,
and then store the checksum as type NUMERIC in the database. This gives
an immediate speed boost. Next, use partial indexes to speed things up
even further. How many partial indexes you want to create depends on your
ratio of selects to updates, and how important each is to you. Some quick
statistical analysis I did showed that for 10 indexes, the magic number
is somewhere around 3.402 x 10 ^ 37. In other words:

CREATE TABLE md5check (id SERIAL, md5 NUMERIC);

CREATE INDEX md5_i0 ON md5check (md5) WHERE
md5 <= 34000000000000000000000000000000000000;
CREATE INDEX md5_i1 ON md5check (md5) WHERE
md5 > 34000000000000000000000000000000000000 AND
md5 <= 68000000000000000000000000000000000000;
CREATE INDEX md5_i2 ON md5check (md5) WHERE
md5 > 68000000000000000000000000000000000000 AND
md5 <= 102000000000000000000000000000000000000;
....
CREATE INDEX md5_i10 ON md5check (md5) WHERE
md5 > 340000000000000000000000000000000000000;

On my test table with 1/2 million rows, I saw a speed up from
..16 msec (using TEXT only) to .09 msec. The more partial indexes
you create, the faster things will go. Just remember to put the
upper and lower boundary indexes in place to catch everything.

Aside: if you are merely testing for the existence of the row,
you can pull back a constant instead of the whole row:

SELECT 1 FROM md5check WHERE md5 = ?
Another way to speed things up is to break the checksum up into parts
so that we can use one of the "normal" datatypes: specifically, BIGINT.
Divide the 32 character checksum into four pieces, convert each piece
to a decimal number, and store each in its own BIGINT column. The good
news with this way is that you only need an index on one of the columns.
Even at 7 million plus, the number of matches of 1/4 of the checksum
characters is small enough to not need additional indexes.

CREATE TABLE md5check
(id SERIAL, md1 BIGINT, md2 BIGINT, md3 BIGINT, md4 BIGINT);

CREATE INDEX md5_i1 ON md5check(md1);

You can also add partial indexes to this as well, for maximum speed.
- --
Greg Sabino Mullane gr**@turnstep.com
PGP Key: 0x14964AC8 200310101135

-----BEGIN PGP SIGNATURE-----
Comment: http://www.turnstep.com/pgp.html

iD8DBQE/ht6DvJuQZxSWSsgRAjvyAJ9ndadWAgJIm84dc/kB8RABEIzIbwCg1UJL
2VUQeQU+LMgXnumOoMT6kWk=
=PeUQ
-----END PGP SIGNATURE-----

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend


请继续讨论清单,以便其他人可以从建议的解决方案中学习或评论




10月10日星期五, 2003年11:27:50-0400,

Jeff Boes< jb *** @ nexcerpt.com>写道:
Please keep discussions on the list so that others may learn from or comment
on the suggested solutions.

On Fri, Oct 10, 2003 at 11:27:50 -0400,
Jeff Boes <jb***@nexcerpt.com> wrote:
Bruno Wolff III写道:
Bruno Wolff III wrote:
2003年10月9日星期四18:37:19 +0000,
Jeff Boes< ; JB *** @ nexcerpt.com>写道:

On Thu, Oct 09, 2003 at 18:37:19 +0000,
Jeff Boes <jb***@nexcerpt.com> wrote:


现在的想法是将这个表分成16个(或256个,
或......)块数字(或2,或......)。在最简单的情况下,这意味着:

The idea bandied about now is to partition this table into 16 (or 256,
or ...) chunks by first digit (or 2, or ...). In the simplest case, this
would mean:



如果校验和列上有索引,那么你不应该得到
通过分区数据加快速度。
如果你没有校验和的索引,听起来应该是这样。



If there is an index on the checksum column, then you shouldn''t get
much of a speed up by partitioning the data.
If you don''t have an index on the checksum, it sounds like you should.


是的,表有:

表" public.link_checksums"
列|输入|修饰语
--------- + --------------- + -----------
md5 | character(32)|不是null
link_id |整数|非null
索引:ix_link_checksums_pk主键btree(md5)


Yes, the table has:

Table "public.link_checksums"
Column | Type | Modifiers
---------+---------------+-----------
md5 | character(32) | not null
link_id | integer | not null
Indexes: ix_link_checksums_pk primary key btree (md5)




在那种情况下我希望你可能只保存一些磁盘访问
通过使用较少级别的btree来获得



如果查询速度很慢,可能会因为

a类型不匹配而进行顺序搜索。您可以使用说明来仔细检查使用的是什么计划




----------------- ----------(广播结束)---------------------------

提示4:不要杀死-9''邮政局长



In that event I would expect that you might only save a few disk accesses
by having a btree with fewer levels.

If the query is slow, it might be doing a sequential search because of
a type mismatch. You can use explain to double check what plan is being
used.

---------------------------(end of broadcast)---------------------------
TIP 4: Don''t ''kill -9'' the postmaster


这篇关于表分区以获得最大速度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆