PostgreSQL-在大数据库中使用数组的性能 [英] Postgresql - performance of using array in big database

查看:352
本文介绍了PostgreSQL-在大数据库中使用数组的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我们有一个包含600万条记录的表.有16个整数列和几个文本列.它是只读表,因此每个整数列都有一个索引. 每条记录大约为50-60字节.

Let say we have a table with 6 million records. There are 16 integer columns and few text column. It is read-only table so every integer column have an index. Every record is around 50-60 bytes.

表名称为项目"
该服务器是:12 GB RAM,1.5 TB SATA,4个CORES. Postgres的所有服务器.
该数据库中还有更多表,因此RAM不能覆盖所有数据库.

The table name is "Item"
The server is: 12 GB RAM, 1,5 TB SATA, 4 CORES. All server for postgres.
There are many more tables in this database so RAM do not cover all database.

我想在表"Item"中添加一列"a_elements"(大整数的数组类型) 在此列中,每条记录将不超过50-60个元素.

I want to add to table "Item" a column "a_elements" (array type of big integers) Every record would have not more than 50-60 elements in this column.

之后,我将在此列上创建索引GIN,典型的查询应如下所示:

After that i would create index GIN on this column and typical query should look like this:

select * from item where ...... and '{5}' <@ a_elements;

我还有第二个,更经典的选择.

I have also second, more classical, option.

不要将a_elements列添加到表项,而是创建具有两列的表元素:

Do not add column a_elements to table item but create table elements with two columns:

  • id_item
  • id_element

此表将有大约2亿条记录.

This table would have around 200 mln records.

我能够在此表上进行分区,因此记录数将减少到表元素中的2000万和表项中的500 K.

I am able to do partitioning on this tables so number of records would reduce to 20 mln in table elements and 500 K in table item.

第二个选项查询如下:

select item.* 
from item 
    left join elements on (item.id_item=elements.id_item) 
where .... 
and 5 = elements.id_element

我想知道哪种选择在性能上会更好. Postgres是否可以在单个查询中使用索引GIN(选项1)使用许多不同的索引?

I wonder what option would be better at performance point of view. Is postgres able to use many different indexes with index GIN (option 1) in a single query ?

我需要做出一个明智的决定,因为导入此数据将花费我20天的时间.

I need to make a good decision because import of this data will take me a 20 days.

推荐答案

我认为您应该使用elements表:

  • Postgres将能够使用统计信息来预测在执行查询之前将匹配多少行,因此它将能够使用最佳的查询计划(如果您的数据分布不均匀,这一点更为重要);

  • Postgres would be able to use statistics to predict how many rows will match before executing query, so it would be able to use the best query plan (it is more important if your data is not evenly distributed);

您将能够使用CLUSTER elements USING elements_id_element_idx;

当发布Postgres 9.2时,您将能够利用仅索引扫描;

when Postgres 9.2 would be released then you would be able to take advantage of index only scans;

但是我已经对1000万个元素进行了一些测试:

But I've made some tests for 10M elements:

create table elements (id_item bigint, id_element bigint);
insert into elements
  select (random()*524288)::int, (random()*32768)::int
    from generate_series(1,10000000);

\timing
create index elements_id_item on elements(id_item);
Time: 15470,685 ms
create index elements_id_element on elements(id_element);
Time: 15121,090 ms

select relation, pg_size_pretty(pg_relation_size(relation))
  from (
    select unnest(array['elements','elements_id_item', 'elements_id_element'])
      as relation
  ) as _;
      relation       | pg_size_pretty 
---------------------+----------------
 elements            | 422 MB
 elements_id_item    | 214 MB
 elements_id_element | 214 MB



create table arrays (id_item bigint, a_elements bigint[]);
insert into arrays select array_agg(id_element) from elements group by id_item;

create index arrays_a_elements_idx on arrays using gin (a_elements);
Time: 22102,700 ms

select relation, pg_size_pretty(pg_relation_size(relation))
  from (
    select unnest(array['arrays','arrays_a_elements_idx']) as relation
  ) as _;
       relation        | pg_size_pretty 
-----------------------+----------------
 arrays                | 108 MB
 arrays_a_elements_idx | 73 MB

因此,另一方面,数组较小,索引较小.在做出决定之前,我会做一些200M元素测试.

So in the other hand arrays are smaller, and have smaller index. I'd do some 200M elements tests before making a decision.

这篇关于PostgreSQL-在大数据库中使用数组的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆