优化 postgres 相似性查询(pg_trgm + gin 索引) [英] Optimizing a postgres similarity query (pg_trgm + gin index)

查看:223
本文介绍了优化 postgres 相似性查询(pg_trgm + gin 索引)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我定义了以下索引:

CREATE INDEX
    users_search_idx
ON
    auth_user
USING
    gin(
        username gin_trgm_ops,
        first_name gin_trgm_ops,
        last_name gin_trgm_ops
    );

我正在执行以下查询:

PREPARE user_search (TEXT, INT) AS
    SELECT
        username,
        email,
        first_name,
        last_name,
        ( -- would probably do per-field weightings here
            s_username + s_first_name + s_last_name
        ) rank
    FROM
        auth_user,
        similarity(username, $1) s_username,
        similarity(first_name, $1) s_first_name,
        similarity(last_name, $1) s_last_name
    WHERE
        username % $1 OR
        first_name % $1 OR
        last_name % $1
    ORDER BY
        rank DESC
    LIMIT $2;

auth_user 表有 620 万行.

The auth_user table has 6.2 million rows.

查询速度似乎在很大程度上取决于 similarity 查询可能返回的结果数量.

The speed of the query seems to depend very heavily on the number of results potentially returned by the similarity query.

通过 set_limit 增加相似度阈值有帮助,但会因消除部分匹配而降低结果的有用性.

Increasing the similarity threshold via set_limit helps, but reduces usefulness of results by eliminating partial matches.

有些搜索会在 200 毫秒内返回,有些则需要大约 10 秒.

Some searches return in 200ms, others take ~ 10 seconds.

我们有一个使用 Elasticsearch 的现有功能实现,它返回

任何查询都需要 200 毫秒,同时进行更复杂(更好)的排名.

We have an existing implementation of this feature using Elasticsearch that returns in < 200ms for any query, while doing more complicated (better) ranking.

我想知道是否有任何方法可以改进这一点以获得更一致的性能?

I would like to know if there is any way we could improve this to get more consistent performance?

我的理解是 GIN 索引(倒排索引)与 Elasticsearch 使用的基本方法相同,所以我认为有一些优化是可能的.

It's my understanding that GIN index (inverted index) is the same basic approach used by Elasticsearch so I would have thought there is some optimization possible.

一个 EXPLAIN ANALYZE EXECUTE user_search('mel', 20) 显示:

Limit  (cost=54099.81..54099.86 rows=20 width=52) (actual time=10302.092..10302.104 rows=20 loops=1)
  ->  Sort  (cost=54099.81..54146.66 rows=18739 width=52) (actual time=10302.091..10302.095 rows=20 loops=1)
        Sort Key: (((s_username.s_username + s_first_name.s_first_name) + s_last_name.s_last_name)) DESC
        Sort Method: top-N heapsort  Memory: 26kB
        ->  Nested Loop  (cost=382.74..53601.17 rows=18739 width=52) (actual time=118.164..10293.765 rows=8380 loops=1)
              ->  Nested Loop  (cost=382.74..53132.69 rows=18739 width=56) (actual time=118.150..10262.804 rows=8380 loops=1)
                    ->  Nested Loop  (cost=382.74..52757.91 rows=18739 width=52) (actual time=118.142..10233.990 rows=8380 loops=1)
                          ->  Bitmap Heap Scan on auth_user  (cost=382.74..52383.13 rows=18739 width=48) (actual time=118.128..10186.816 rows=8380loops=1)"
                                Recheck Cond: (((username)::text % 'mel'::text) OR ((first_name)::text % 'mel'::text) OR ((last_name)::text %'mel'::text))"
                                Rows Removed by Index Recheck: 2434523
                                Heap Blocks: exact=49337 lossy=53104
                                ->  BitmapOr  (cost=382.74..382.74 rows=18757 width=0) (actual time=107.436..107.436 rows=0 loops=1)
                                      ->  Bitmap Index Scan on users_search_idx  (cost=0.00..122.89 rows=6252 width=0) (actual time=40.200..40.200rows=88908 loops=1)"
                                            Index Cond: ((username)::text % 'mel'::text)
                                      ->  Bitmap Index Scan on users_search_idx  (cost=0.00..122.89 rows=6252 width=0) (actual time=43.847..43.847rows=102028 loops=1)"
                                            Index Cond: ((first_name)::text % 'mel'::text)
                                      ->  Bitmap Index Scan on users_search_idx  (cost=0.00..122.89 rows=6252 width=0) (actual time=23.387..23.387rows=58740 loops=1)"
                                            Index Cond: ((last_name)::text % 'mel'::text)
                          ->  Function Scan on similarity s_username  (cost=0.00..0.01 rows=1 width=4) (actual time=0.004..0.004 rows=1 loops=8380)
                    ->  Function Scan on similarity s_first_name  (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=8380)
              ->  Function Scan on similarity s_last_name  (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=8380)
Execution time: 10302.559 ms

服务器是在 Amazon RDS 上运行的 Postgres 9.6.1

Server is Postgres 9.6.1 running on Amazon RDS

在发布问题后不久,我发现了以下信息:https://www.postgresql.org/message-id/464F3C5D.2000700@enterprisedb.com

Shortly after posting the question I found this info: https://www.postgresql.org/message-id/464F3C5D.2000700@enterprisedb.com

所以我尝试了

-> SHOW work_mem;
4MB
-> SET work_mem='12MB';
-> EXECUTE user_search('mel', 20);
(results returned in ~1.5s)

这有了很大的改进(以前是 > 10 秒)!

This made a big improvement (previously > 10s)!

对于类似查询,1.5s 仍然比 ES 慢得多,所以我仍然希望听到任何优化查询的建议.

回应评论,并在看到这个问题后(Postgresql GIN 索引比 pg_trgm 的 GIST 慢),我尝试了完全相同的设置,用 GIST 索引代替 GIN 索引.

In response to comments, and after seeing this question (Postgresql GIN index slower than GIST for pg_trgm), I tried exactly the same set up with a GIST index in place of the GIN one.

尝试上述相同的搜索,它在大约 3.5 秒内返回,使用默认 work_mem='4MB'.增加 work_mem 没有任何区别.

Trying the same search above, it returned in ~3.5s, using default work_mem='4MB'. Increasing work_mem made no difference.

由此我得出结论,GIST 索引的内存效率更高(没有像 GIN 那样遇到病理情况)但是当 GIN 正常工作时比 GIN 慢.这与推荐 GIN 索引的文档中描述的内容一致.

From this I conclude that GIST index is more memory efficient (did not hit pathological case like GIN did) but is slower than GIN when GIN is working properly. This is inline with what's described in the docs recommending GIN index.

我还是不明白为什么要花这么多时间:

I still don't understand why so much time is spent in:

 ->  Bitmap Heap Scan on auth_user  (cost=382.74..52383.13 rows=18739 width=48) (actual time=118.128..10186.816 rows=8380loops=1)"
     Recheck Cond: (((username)::text % 'mel'::text) OR ((first_name)::text % 'mel'::text) OR ((last_name)::text %'mel'::text))"
     Rows Removed by Index Recheck: 2434523
     Heap Blocks: exact=49337 lossy=53104

我不明白为什么需要这一步或它在做什么.

I don't understand why this step is needed or what it's doing.

对于每个 username % $1 子句,它下面有三个 Bitmap Index Scan...然后这些结果与 BitmapOr 组合在一起代码>步骤.这些部分都很快.

There are the three Bitmap Index Scan beneath it for each of the username % $1 clauses... these results then get combined with a BitmapOr step. These parts are all quite fast.

但即使在我们没有用完工作内存的情况下,我们仍然在位图堆扫描上花费了将近一秒钟.

But even in the case where we don't run out of work mem, we still spend nearly a whole second in Bitmap Heap Scan.

推荐答案

我希望使用这种方法 更快地获得结果:

I expect much faster results with this approach:

创建一个 GiST 索引,其中 1 列包含连接值:

Create a GiST index with 1 column holding concatenated values:

CREATE INDEX users_search_idx ON auth_user
USING gist((username || ' ' || first_name || ' ' || last_name) gist_trgm_ops);

这假设所有 3 列都被定义为 NOT NULL(您没有指定).否则你需要做更多.
为什么不使用 concat_ws() 简化?

This assumes all 3 columns to be defined NOT NULL (you did not specify). Else you need to do more.
Why not simplify with concat_ws()?

使用适当的 查询,匹配上面的索引:

Use a proper nearest-neighbor query, matching above index:

SELECT username, email, first_name, last_name
     , similarity(username  , $1) AS s_username
     , similarity(first_name, $1) AS s_first_name
     , similarity(last_name , $1) AS s_last_name
     , row_number() OVER () AS rank  -- greatest similarity first
FROM   auth_user
WHERE     (username || ' ' || first_name || ' ' || last_name) %   $1  -- !!
ORDER  BY (username || ' ' || first_name || ' ' || last_name) <-> $1  -- !!
LIMIT  $2;

WHEREORDER BY 中的表达式必须匹配索引表达式!

Expressions in WHERE and ORDER BY must match index expression!

特别是 ORDER BY rank(就像你曾经拥有的那样)对于从更大的合格行池中挑选的小 LIMIT 总是表现不佳,因为它不能使用直接索引:rank 背后的复杂表达式必须为每个 符合条件的行计算,然后在返回最佳匹配的小选择之前必须对所有行进行排序.这比真正的最近邻查询要昂贵得多,后者可以直接从索引中选择最佳结果,甚至无需查看其余查询.

In particular ORDER BY rank (like you had it) will always perform poorly for a small LIMIT picking from a much larger pool of qualifying rows, because it cannot use an index directly: The sophisticated expression behind rank has to be calculated for every qualifying row, then all have to be sorted before the small selection of best matches can be returned. This is much, much more expensive than a true nearest-neighbor query that can pick the best results from the index directly without even looking at the rest.

row_number() 仅反映同一 SELECTORDER BY 产生的顺序.

row_number() with empty window definition just reflects the ordering produced by the ORDER BY of the same SELECT.

相关答案:

至于您的项目 3.,我在您引用的问题中添加了一个答案,应该可以解释一下:

As for your item 3., I added an answer to the question you referenced, that should explain it:

这篇关于优化 postgres 相似性查询(pg_trgm + gin 索引)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆