使用pg_trgm搜索3亿个地址 [英] Search in 300 million addresses with pg_trgm

查看:163
本文介绍了使用pg_trgm搜索3亿个地址的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的PostgreSQL 9.3数据库中有3亿个地址,我想使用pg_trgm来模糊搜索行.最终目的是实现类似于Google Map搜索的搜索功能.

I have 300 million addresses in my PostgreSQL 9.3 DB and I want to use pg_trgm to fuzzy search the rows. The final purpose is to implement a search function just like Google Map search.

当我使用pg_trgm搜索这些地址时,花费大约30s来获得结果.有许多行符合默认相似度阈值条件0.3,但我只需要大约5或10个结果.我创建了Trigram GiST索引:

When I used pg_trgm to search these addresses, it cost about 30s to get the results. There are many rows matching the default similarity threshold condition of 0.3 but I just need about 5 or 10 results. I created a trigram GiST index:

CREATE INDEX addresses_trgm_index ON addresses USING gist (address gist_trgm_ops);

这是我的查询:

SELECT address, similarity(address, '981 maun st') AS sml 
FROM addresses 
WHERE address % '981 maun st' 
ORDER BY sml DESC 
LIMIT 10;

已删除生产环境上的测试表.我显示了我的测试环境中的EXPLAIN输出.大约有700万行,大约需要1.6秒才能获得结果.拥有3亿,则需要30多秒钟.

The test table on production environment has been removed. I show the EXPLAIN output from my test environment. There are about 7 million rows and it needs about 1.6s to get results. With 300 million, it needs more than 30s.

ebdb=> explain analyse select address, similarity(address, '781 maun st') as sml from addresses where address % '781 maun st' order by sml desc limit 10;
                                    QUERY PLAN                                                                            
————————————————————————————————————————————————————————————————————————————————    
 Limit  (cost=7615.83..7615.86 rows=10 width=16) (actual time=1661.004..1661.010 rows=10 loops=1)
 ->  Sort  (cost=7615.83..7634.00 rows=7268 width=16) (actual time=1661.003..1661.005 rows=10 loops=1)
     Sort Key: (similarity((address)::text, '781 maun st'::text))
     Sort Method: top-N heapsort  Memory: 25kB
     ->  Index Scan using addresses_trgm_index on addresses  (cost=0.41..7458.78 rows=7268 width=16) (actual time=0.659..1656.386 rows=5241 loops=1)
           Index Cond: ((address)::text % '781 maun st'::text)
 Total runtime: 1661.066 ms
(7 rows)

是否有提高性能的好方法,还是进行表分区的好计划?

Is there a good way to improve the performance or is it a good plan to do table partitioning?

推荐答案

PostgreSQL 9.3 ...是否有提高性能的好方法,还是进行表分区的好计划?

PostgreSQL 9.3 ... Is there a good way to improve the performance or is it a good plan to do table partitioning?

表分区对完全没有帮助.

但是可以,有一个好方法:升级到最新版本的Postgres. GiST索引,尤其是pg_trgm模块,以及大数据,已经有了很多改进.使用Postgres 9.6或即将推出的Postgres 10(目前为beta)应该会大大加快.

But yes, there is a good way: Upgrade to a current version of Postgres. There have been many improvements for GiST indexes, for the pg_trgm module in particular and for big data in general. Should be substantially faster with Postgres 9.6 or the upcoming Postgres 10 (currently beta).

您最近的邻居"看起来正确,但对于较小的LIMIT,请使用以下等效查询:

You "nearest neighbor" looks correct but for a small LIMIT use this equivalent query instead:

SELECT address, similarity(address, '981 maun st') AS sml 
FROM   addresses 
WHERE  address % '981 maun st' 
ORDER  BY address <-> '981 maun st'
LIMIT  10;

引用该手册:

通常只有少量的 需要最接近的匹配项.

It will usually beat the first formulation when only a small number of the closest matches is wanted.

这篇关于使用pg_trgm搜索3亿个地址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆