PostgreSQL:如何优化我的数据库以存储和查询巨大的图形 [英] PostgreSQL: How to optimize my database for storing and querying a huge graph

查看:17
本文介绍了PostgreSQL:如何优化我的数据库以存储和查询巨大的图形的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在配备 1GB RAM 和 Mac OS X 10.5.8 的 1.83 GHz Intel Core Duo Mac Mini 上运行 PostgreSQL 8.3.我在我的 PostgreSQL 数据库中存储了一个巨大的图表.它由 160 万个节点和 3000 万条边组成.我的数据库架构是这样的:

I'm running PostgreSQL 8.3 on a 1.83 GHz Intel Core Duo Mac Mini with 1GB of RAM and Mac OS X 10.5.8. I have a stored a huge graph in my PostgreSQL database. It consists of 1.6 million nodes and 30 million edges. My database schema is like:

CREATE TABLE nodes (id INTEGER PRIMARY KEY,title VARCHAR(256));
CREATE TABLE edges (id INTEGER,link INTEGER,PRIMARY KEY (id,link));
CREATE INDEX id_idx ON edges (id);
CREATE INDEX link_idx ON edges (link);

表格边缘的数据看起来像

The data in the table edges looks like

id link 
1  234
1  88865
1  6
2  365
2  12
...

因此它为每个具有 id x 的节点存储到 id y 的传出链接.

So it stores for each node with id x the outgoing link to id y.

搜索所有出站链接的时间是可以的:

The time for searching all the outgoing links is ok:

=# explain analyze select link from edges where id=4620;
                           QUERY PLAN                                                        
    ---------------------------------------------------------------------------------
     Index Scan using id_idx on edges  (cost=0.00..101.61 rows=3067 width=4) (actual time=135.507..157.982 rows=1052 loops=1)
       Index Cond: (id = 4620)
     Total runtime: 158.348 ms
    (3 rows)

但是,如果我搜索一个节点的传入链接,数据库会慢100多倍(虽然结果传入链接的数量只比传出链接的数量高5-10倍):

However, if I search for the incoming links to a node, the database is more than 100 times slower (although the resulting number of incoming links is only 5-10 times higher than the number of outgoing links):

=# explain analyze select id from edges where link=4620;
                         QUERY PLAN                                                           
----------------------------------------------------------------------------------
     Bitmap Heap Scan on edges  (cost=846.31..100697.48 rows=51016 width=4) (actual time=322.584..48983.478 rows=26887 loops=1)
       Recheck Cond: (link = 4620)
       ->  Bitmap Index Scan on link_idx  (cost=0.00..833.56 rows=51016 width=0) (actual time=298.132..298.132 rows=26887 loops=1)
             Index Cond: (link = 4620)
     Total runtime: 49001.936 ms
    (5 rows)

我试图通过

=# set enable_bitmapscan = false;

但查询传入链接的速度没有提高:

but the speed of the query for incoming links didn't improve:

=# explain analyze select id from edges where link=1588;
                      QUERY PLAN                                                           
-------------------------------------------------------------------------------------------
 Index Scan using link_idx on edges  (cost=0.00..4467.63 rows=1143 width=4) (actual time=110.302..51275.822 rows=43629 loops=1)
   Index Cond: (link = 1588)
 Total runtime: 51300.041 ms
(3 rows)

我还将共享缓冲区从 24MB 增加到 512MB,但没有帮助.所以我想知道为什么我对传出和传入链接的查询显示出如此不对称的行为?我选择的索引有问题吗?或者我应该更好地创建第三个表,其中包含 ID 为 x 的节点的所有传入链接?但这会相当浪费磁盘空间.但是由于我是 SQL 数据库的新手,也许我在这里遗漏了一些基本的东西?

I also increased my shared buffers from 24MB to 512MB, but it didn't help. So I wonder why my queries for outgoing and incoming links show such an asymmetric behaviour? Is something wrong with my choice of indexes? Or should I better create a third table holding all the incoming links for a node with id x? But that would be quite a waste of disk space. But since I'm new into SQL databases maybe I'm missing something basic here?

推荐答案

我认为 habe 是对的.

您可以通过在边缘上使用 cluster link_idx 来检查这一点;填充表格后分析边缘.现在第二个查询应该很快,第一个应该很慢.

You can check this by using cluster link_idx on edges; analyze edges after filling the table. Now the second query should be fast, and first should be slow.

要快速进行这两个查询,您必须按照您的建议使用第二个表进行非规范化.请记住在加载数据后对第二个表进行聚类和分析,这样所有链接到节点的 egdes 都将在物理上分组.

To have both queries fast you'll have to denormalize by using a second table, as you have proposed. Just remember to cluster and analyze this second table after loading your data, so all egdes linking to a node will be physically grouped.

如果你不会一直查询这个并且你不想存储和备份第二个表,那么你可以在查询之前临时创建它:

If you will not query this all the time and you do not want to store and backup this second table then you can create it temporarily before querying:

create temporary table egdes_backwards
  as select link, id from edges order by link, id;
create index edges_backwards_link_idx on edges_backwards(link);

您不必对这个临时表进行集群,因为它将在创建时物理排序.它对一个查询没有意义,但可以帮助连续多个查询.

You do not have to cluster this temporary table, as it will be physically ordered right on creation. It does not make sense for one query, but can help for several queries in a row.

这篇关于PostgreSQL:如何优化我的数据库以存储和查询巨大的图形的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆