PostgreSQL:如何优化我的数据库以存储和查询巨大的图 [英] PostgreSQL: How to optimize my database for storing and querying a huge graph

查看:116
本文介绍了PostgreSQL:如何优化我的数据库以存储和查询巨大的图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在1GB内存和Mac OS X 10.5.8的1.83 GHz Intel Core Duo Mac Mini上运行PostgreSQL 8.3。我在PostgreSQL数据库中保存了一个巨大的图形。它由160万个节点和3000万条边组成。我的数据库模式如下所示:

  CREATE TABLE节点(id INTEGER PRIMARY KEY,title VARCHAR(256)); 
CREATE TABLE边缘(id INTEGER,link INTEGER,PRIMARY KEY(id,link));
CREATE INDEX id_idx ON边缘(id);
CREATE INDEX link_idx ON边缘(链接);

表格边缘的数据看起来像

  id链接
1 234
1 88865
1 6
2 365
2 12
...

因此,它为id为x的每个节点存储出站链接id y。



搜索所有传出链接的时间可以:

  =#解释分析选择从边缘的链接,其中id = 4620; 
QUERY PLAN
---------------------------------------- -----------------------------------------
索引使用id_idx开启扫描(成本= 0.00..101.61行= 3067宽度= 4)(实际时间= 135.507..157.982行= 1052循环= 1)
指数条件:(id = 4620)
总运行时间:158.348 ms
(3行)

但是,如果我搜索到节点的传入链接,数据库比速度慢100倍以上(尽管由此产生的传入链接数量仅比传出链接的数量高出5-10倍):

  =#解释从链接= 4620分析选择ID; 
QUERY PLAN
---------------------------------------- ------------------------------------------
位图堆扫描开启(成本= 846.31..100697.48行= 51016宽度= 4)(实际时间= 322.584..48983.478行= 26887循环= 1)
重新检查条件:(link = 4620)
- > link_idx上的位图索引扫描(成本= 0.00..833.56行= 51016宽度= 0)(实际时间= 298.132..298.132行= 26887个循环= 1)
索引条件:(link = 4620)
总运行时间:49001.936 ms
(5行)

我试图强制Postgres不使用位图扫描通过

  =#set enable_bitmapscan = false; 

但传入链接的查询速度没有提高:

  =#解释分析链接= 1588的边缘选择ID; 
QUERY PLAN
---------------------------------------- -------------------------------------------------- -
在边缘使用link_idx进行索引扫描(cost = 0.00..4467.63 rows = 1143 width = 4)(实际时间= 110.302..51275.822 rows = 43629 loops = 1)
索引条件:(link = 1588)
总运行时间:51300.041毫秒
(3行)

我也我的共享缓冲区从24MB增加到512MB,但没有帮助。所以我想知道为什么我对传出和传入链接的查询显示出这种不对称行为?我的索引选择有问题吗?或者我应该更好地创建第三个表,其中包含ID为x的节点的所有传入链接?但是这会浪费磁盘空间。但是因为我刚刚进入SQL数据库,可能我在这里缺少一些基本的东西? 优化我的数据库的存储和查询一个巨大的图/ 1824225#1824225>哈伯是正确的。



您可以通过在边上使用 cluster link_idx来检查此问题;填充表格后分析边缘 。现在第二个查询应该是快速的,首先应该是慢的。



为了使两个查询的速度都快,您必须使用第二个表进行非规范化。请记住在加载数据后对第二个表进行聚类和分析,因此链接到节点的所有egdes都将进行物理分组。

如果您不会一直查询并且您不想存储和备份第二个表,那么您可以在查询之前暂时创建它:

 创建临时表egdes_backwards 
作为选择链接,id从边缘按链接排序,id;
在edges_backward(link)上创建索引edges_backwards_link_idx;

您不必将该临时表集群在一起,因为它将在创建时进行物理排序。它对一个查询没有意义,但可以帮助连续查询几个查询。


I'm running PostgreSQL 8.3 on a 1.83 GHz Intel Core Duo Mac Mini with 1GB of RAM and Mac OS X 10.5.8. I have a stored a huge graph in my PostgreSQL database. It consists of 1.6 million nodes and 30 million edges. My database schema is like:

CREATE TABLE nodes (id INTEGER PRIMARY KEY,title VARCHAR(256));
CREATE TABLE edges (id INTEGER,link INTEGER,PRIMARY KEY (id,link));
CREATE INDEX id_idx ON edges (id);
CREATE INDEX link_idx ON edges (link);

The data in the table edges looks like

id link 
1  234
1  88865
1  6
2  365
2  12
...

So it stores for each node with id x the outgoing link to id y.

The time for searching all the outgoing links is ok:

=# explain analyze select link from edges where id=4620;
                           QUERY PLAN                                                        
    ---------------------------------------------------------------------------------
     Index Scan using id_idx on edges  (cost=0.00..101.61 rows=3067 width=4) (actual time=135.507..157.982 rows=1052 loops=1)
       Index Cond: (id = 4620)
     Total runtime: 158.348 ms
    (3 rows)

However, if I search for the incoming links to a node, the database is more than 100 times slower (although the resulting number of incoming links is only 5-10 times higher than the number of outgoing links):

=# explain analyze select id from edges where link=4620;
                         QUERY PLAN                                                           
----------------------------------------------------------------------------------
     Bitmap Heap Scan on edges  (cost=846.31..100697.48 rows=51016 width=4) (actual time=322.584..48983.478 rows=26887 loops=1)
       Recheck Cond: (link = 4620)
       ->  Bitmap Index Scan on link_idx  (cost=0.00..833.56 rows=51016 width=0) (actual time=298.132..298.132 rows=26887 loops=1)
             Index Cond: (link = 4620)
     Total runtime: 49001.936 ms
    (5 rows)

I tried to force Postgres not to use a Bitmap Scan via

=# set enable_bitmapscan = false;

but the speed of the query for incoming links didn't improve:

=# explain analyze select id from edges where link=1588;
                      QUERY PLAN                                                           
-------------------------------------------------------------------------------------------
 Index Scan using link_idx on edges  (cost=0.00..4467.63 rows=1143 width=4) (actual time=110.302..51275.822 rows=43629 loops=1)
   Index Cond: (link = 1588)
 Total runtime: 51300.041 ms
(3 rows)

I also increased my shared buffers from 24MB to 512MB, but it didn't help. So I wonder why my queries for outgoing and incoming links show such an asymmetric behaviour? Is something wrong with my choice of indexes? Or should I better create a third table holding all the incoming links for a node with id x? But that would be quite a waste of disk space. But since I'm new into SQL databases maybe I'm missing something basic here?

解决方案

I think habe is right.

You can check this by using cluster link_idx on edges; analyze edges after filling the table. Now the second query should be fast, and first should be slow.

To have both queries fast you'll have to denormalize by using a second table, as you have proposed. Just remember to cluster and analyze this second table after loading your data, so all egdes linking to a node will be physically grouped.

If you will not query this all the time and you do not want to store and backup this second table then you can create it temporarily before querying:

create temporary table egdes_backwards
  as select link, id from edges order by link, id;
create index edges_backwards_link_idx on edges_backwards(link);

You do not have to cluster this temporary table, as it will be physically ordered right on creation. It does not make sense for one query, but can help for several queries in a row.

这篇关于PostgreSQL:如何优化我的数据库以存储和查询巨大的图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆