PostgreSQL使用pg_trgm慢然后进行全面扫描 [英] PostgreSQL using pg_trgm slower then full scan

查看:95
本文介绍了PostgreSQL使用pg_trgm慢然后进行全面扫描的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在玩 pg_trgm 扩展名,我有些困惑。这是会话:

I playing with pg_trgm extension and I am confused a bit. Here is the session:

postgres=# create table t(i int, x text);
CREATE TABLE
postgres=# insert into t select i, random()::text from generate_series(1,50000000) as i;
INSERT 0 50000000
postgres=# explain analyze select * from t where x ilike '%666666%';
                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..531870.29 rows=12954 width=36) (actual time=131.436..11408.176 rows=432 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on t  (cost=0.00..529574.89 rows=5398 width=36) (actual time=108.771..11304.946 rows=144 loops=3)
         Filter: (x ~~* '%666666%'::text)
         Rows Removed by Filter: 16666523
 Planning Time: 0.121 ms
 Execution Time: 11408.279 ms
(8 rows)

postgres=# explain analyze select * from t where x ilike '%666666%';
                                                        QUERY PLAN                                                        
--------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..580654.94 rows=5000 width=21) (actual time=124.986..11070.983 rows=432 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on t  (cost=0.00..579154.94 rows=2083 width=21) (actual time=72.207..11010.876 rows=144 loops=3)
         Filter: (x ~~* '%666666%'::text)
         Rows Removed by Filter: 16666523
 Planning Time: 0.283 ms
 Execution Time: 11071.065 ms
(8 rows)

postgres=# create index i on t using gin (x gin_trgm_ops);
CREATE INDEX
postgres=# analyze t;
ANALYZE
postgres=# explain analyze select * from t where x ilike '%666666%';
                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t  (cost=54.75..18107.93 rows=5000 width=21) (actual time=116.114..26995.773 rows=432 loops=1)
   Recheck Cond: (x ~~* '%666666%'::text)
   Rows Removed by Index Recheck: 36257910
   Heap Blocks: exact=39064 lossy=230594
   ->  Bitmap Index Scan on i  (cost=0.00..53.50 rows=5000 width=0) (actual time=75.363..75.363 rows=592216 loops=1)
         Index Cond: (x ~~* '%666666%'::text)
 Planning Time: 0.389 ms
 Execution Time: 26996.429 ms
(8 rows)

postgres=# explain analyze select * from t where x ilike '%666666%';
                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t  (cost=54.75..18107.93 rows=5000 width=21) (actual time=128.859..29231.765 rows=432 loops=1)
   Recheck Cond: (x ~~* '%666666%'::text)
   Rows Removed by Index Recheck: 36257910
   Heap Blocks: exact=39064 lossy=230594
   ->  Bitmap Index Scan on i  (cost=0.00..53.50 rows=5000 width=0) (actual time=79.147..79.147 rows=592216 loops=1)
         Index Cond: (x ~~* '%666666%'::text)
 Planning Time: 0.252 ms
 Execution Time: 29231.945 ms
(8 rows)

如您所见,没有索引查询的速度是索引的两倍。只是现在有默认的PostgreSQL设置(共享缓冲区,工作内存等)

As you can see that without index the query is more then two times faster then with index. Just for now there are default PostgreSQL settings (shared buffers, work memory etc)

我错过了什么?

PS:x86_64-pc-linux-gnu上的PostgreSQL 11.5(Ubuntu 11.5-1.pgdg18.04 + 1),由gcc(Ubuntu 7.4.0-1ubuntu1〜18.04.1)编译7.4.0,64位

PS: PostgreSQL 11.5 (Ubuntu 11.5-1.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0, 64-bit

PPS:使用要点索引会更慢。

PPS: Using gist index it is even slower.

推荐答案

tldr:三元组可能不适合搜索由重复N次的单个字符组成的模式(例如 666666 ),因为仅存在1个非终止三元组,并且可能在搜索空间中出现率很高。

tldr: trigrams may not good at searching for patterns consisting of a single character repeated N times (such as 666666) because there exists only 1 non-terminal trigram and that could have a high occurrence in the search space.

使用gin-index时,由于行太大而无法容纳在内存中,因此它存储对页面的引用,并且数据库必须对这些页面执行进一步的重新检查扫描。如果重新检查的页数很少,则使用索引仍然是有益的,但是,如果重新检查的页数较高,则索引的性能会很差。在您的说明输出中,以下行突出显示了这一点

When the gin-index is used, the bitmap of rows is too large to fit in memory, so instead it stores reference to pages, and the database has to perform a further recheck scan over these pages. If the number of rechecked pages is small, the index-use is still beneficial, however with high number of recheck pages the index performs poorly. This is highlighted by the following lines in your explain output

   Recheck Cond: (x ~~* '%666666%'::text)
   Rows Removed by Index Recheck: 36257910
   Heap Blocks: exact=39064 lossy=230594

对于测试数据,该问题特定于您的搜索字符串,即 666666

The issue is particular to your search string, i.e. 666666, with respect to the test data.

如果您运行 select pg_trgm('666666'),则会发现:

if you run select pg_trgm('666666'), you will find:

        show_trgm        
-------------------------
 {"  6"," 66","66 ",666}
(1 row)

前3个三连字母甚至不会在一个相似的上下文(用户 jjanes 建议的更正)。在索引上搜索将产生所有包含 666 的页面。您可以通过运行带有的说明分析查询来验证这一点...我喜欢'%666%',并获得相同的堆块输出如上。

The first 3 trigrams will not even be generated in an ilike context (correction suggested by user jjanes). Searching on the index yields all pages containing 666. You can validate this by running the explain analyze query with ... ilike '%666%', and obtaining the same Heap Blocks output as above.

如果使用模式 123456 搜索,您会发现它的性能要好得多,因为它会生成较大的三词组来搜索:

if you search with the pattern 123456, you will see it performs much better, because it generates a larger set of trigrams to search against:

              show_trgm              
-------------------------------------
 {"  1"," 12",123,234,345,456,"56 "}
(1 row)

在我的机器上,我得到以下信息:

On my machine, I get the following:

|------------------------------------|
| pattern | pages rechecked          |
|         | exact | lossy  | total   |
|------------------------------------|
| 123456  |   600 |        |    600  |
| 666666  | 39454 | 230592 | 270046* |
|    666  | 39454 | 230592 | 270046* |
|------------------------------------|
*this is rougly 85% of the total # of pages used for the table 't'

这是解释输出:

postgres=> explain analyze select * from t where x ~ '123456';
                                                        QUERY PLAN                                                        
--------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t  (cost=90.75..18143.92 rows=5000 width=22) (actual time=110.962..113.509 rows=518 loops=1)
   Recheck Cond: (x ~ '123456'::text)
   Rows Removed by Index Recheck: 83
   Heap Blocks: exact=600
   ->  Bitmap Index Scan on t_x_idx  (cost=0.00..89.50 rows=5000 width=0) (actual time=110.868..110.868 rows=601 loops=1)
         Index Cond: (x ~ '123456'::text)
 Planning time: 0.703 ms
 Execution time: 113.564 ms
(8 rows)

postgres=> explain analyze select * from t where x ~ '666666';
                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t  (cost=54.75..18107.92 rows=5000 width=22) (actual time=137.143..18111.609 rows=462 loops=1)
   Recheck Cond: (x ~ '666666'::text)
   Rows Removed by Index Recheck: 36258389
   Heap Blocks: exact=39454 lossy=230592
   ->  Bitmap Index Scan on t_x_idx  (cost=0.00..53.50 rows=5000 width=0) (actual time=105.962..105.962 rows=593708 loops=1)
         Index Cond: (x ~ '666666'::text)
 Planning time: 0.420 ms
 Execution time: 18111.739 ms
(8 rows)

postgres=> explain analyze select * from t where x ~ '666';
                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t  (cost=54.75..18107.92 rows=5000 width=22) (actual time=102.813..17285.086 rows=593708 loops=1)
   Recheck Cond: (x ~ '666'::text)
   Rows Removed by Index Recheck: 35665143
   Heap Blocks: exact=39454 lossy=230592
   ->  Bitmap Index Scan on t_x_idx  (cost=0.00..53.50 rows=5000 width=0) (actual time=96.100..96.100 rows=593708 loops=1)
         Index Cond: (x ~ '666'::text)
 Planning time: 0.500 ms
 Execution time: 17300.440 ms
(8 rows)

这篇关于PostgreSQL使用pg_trgm慢然后进行全面扫描的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆