使用 PostgreSQL 快速查找相似字符串 [英] Finding similar strings with PostgreSQL quickly

查看:146
本文介绍了使用 PostgreSQL 快速查找相似字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要在表中创建相似字符串的排名.

I need to create a ranking of similar strings in a table.

我有下表

create table names (
name character varying(255)
);

目前,我正在使用提供 similarity 功能的 pg_trgm 模块,但我遇到了效率问题.我创建了一个索引,如 Postgres 手册建议:

Currently, I'm using pg_trgm module which offers the similarity function, but I have an efficiency problem. I created an index like the Postgres manual suggests:

CREATE INDEX trgm_idx ON names USING gist (name gist_trgm_ops);

我正在执行以下查询:

select (similarity(n1.name, n2.name)) as sim, n1.name, n2.name
from names n1, names n2
where n1.name != n2.name and similarity(n1.name, n2.name) > .8
order by sim desc;

该查询有效,但当您有数百个名称时,速度确实很慢.此外,也许我忘记了一些 SQL,但我不明白为什么我不能使用条件 和 sim >.8 没有收到列 sim 不存在"错误.

The query works, but is really slow when you have hundreds of names. Moreover, maybe I forgot a bit of SQL, but I don't understand why I cannot use the condition and sim > .8 without getting a "column sim doesn't exist" error.

我想要任何提示以加快查询速度.

I'd like any hint to make the query faster.

推荐答案

按照您的方式,必须计算表中每个元素与每个其他元素之间的相似性(几乎是交叉连接).如果您的表有 1000 行,那已经是 1,000,000 (!) 次相似性计算,之前可以根据条件检查并排序.缩放得非常厉害.

The way you have it, similarity between every element and every other element of the table has to be calculated (almost a cross join). If your table has 1000 rows, that's already 1,000,000 (!) similarity calculations, before those can be checked against the condition and sorted. Scales terribly.

使用SET pg_trgm.similarity_threshold% 运算符代替.两者都由 pg_trgm 模块提供.这样一来,三元组 GiST 索引就可以发挥很大的作用.

Use SET pg_trgm.similarity_threshold and the % operator instead. Both are provided by the pg_trgm module. This way, a trigram GiST index can be used to great effect.

配置参数pg_trgm.similarity_threshold 替换了函数 set_limitPostgres 9.6 中的 ()show_limit().不推荐使用的函数仍然有效(从 Postgres 13 开始).此外,自 Postgres 9.1 以来,GIN 和 GiST 索引的性能在许多方面都有所提升.

The configuration parameter pg_trgm.similarity_threshold replaced the functions set_limit() and show_limit() in Postgres 9.6. The deprecated functions still work (as of Postgres 13). Also, performance of GIN and GiST indexes improved in many ways since Postgres 9.1.

尝试:

SET pg_trgm.similarity_threshold = 0.8;  -- Postgres 9.6 or later
  
SELECT similarity(n1.name, n2.name) AS sim, n1.name, n2.name
FROM   names n1
JOIN   names n2 ON n1.name <> n2.name
               AND n1.name % n2.name
ORDER  BY sim DESC;

快了几个数量级,但仍然很慢.

Faster by orders of magnitude, but still slow.

pg_trgm.similarity_threshold 是一个 "定制"选项,可以像任何其他选项一样处理.见:

pg_trgm.similarity_threshold is a "customized" option, which can be handled like any other option. See:

您可能希望通过在交叉连接之前添加先决条件(例如匹配首字母)来限制可能对的数量(并通过匹配的功能索引支持).交叉连接的性能随着O(N²)而下降.

You may want to restrict the number of possible pairs by adding preconditions (like matching first letters) before cross joining (and support that with a matching functional index). The performance of a cross join deteriorates with O(N²).

不起作用,因为您不能在WHEREHAVING 子句中引用输出列:

This does not work because you cannot refer to output columns in WHERE or HAVING clauses:

WHERE ... sim > 0.8

这是根据 SQL 标准(由某些其他 RDBMS 相当松散地处理).另一方面:

That's according to the SQL standard (which is handled rather loosely by certain other RDBMS). On the other hand:

ORDER BY sim DESC

有效,因为输出列可以用于GROUP BYORDER BY.见:

Works because output columns can be used in GROUP BY and ORDER BY. See:

我在旧测试服务器上进行了快速测试以验证我的声明.
PostgreSQL 9.1.4.EXPLAIN ANALYZE 花费的时间(最好的 5 个).

I ran a quick test on my old test server to verify my claims.
PostgreSQL 9.1.4. Times taken with EXPLAIN ANALYZE (best of 5).

CREATE TEMP table t AS 
SELECT some_col AS name FROM some_table LIMIT 1000;  -- real life test strings

使用 GIN 索引进行第一轮测试:

First round of tests with GIN index:

CREATE INDEX t_gin ON t USING gin(name gin_trgm_ops);  -- round1: with GIN index

使用 GIST 索引的第二轮测试:

Second round of tests with GIST index:

DROP INDEX t_gin;
CREATE INDEX t_gist ON t USING gist(name gist_trgm_ops);

新查询:

SELECT set_limit(0.8);

SELECT similarity(n1.name, n2.name) AS sim, n1.name, n2.name
FROM   t n1
JOIN   t n2 ON n1.name <> n2.name
           AND n1.name % n2.name
ORDER  BY sim DESC;

使用 GIN 索引,64 次命中:总运行时间:484.022 毫秒
使用 GIST 索引,64 次点击:总运行时间:248.772 毫秒

GIN index used, 64 hits: total runtime: 484.022 ms
GIST index used, 64 hits: total runtime: 248.772 ms

旧查询:

SELECT (similarity(n1.name, n2.name)) as sim, n1.name, n2.name
FROM   t n1, t n2
WHERE  n1.name != n2.name
AND    similarity(n1.name, n2.name) > 0.8
ORDER  BY sim DESC;

GIN 索引使用,64 次点击:总运行时间:6345.833 毫秒
GIST 索引使用,64 次点击:总运行时间:6335.975 毫秒

GIN index not used, 64 hits: total runtime: 6345.833 ms
GIST index not used, 64 hits: total runtime: 6335.975 ms

其他相同的结果.建议很好.这是仅 1000 行

Otherwise identical results. Advice is good. And this is for just 1000 rows!

GIN 通常提供卓越的读取性能:

GIN often provides superior read performance:

但不是在这种特殊情况下!

这可以通过 GiST 索引非常有效地实现,但不能通过GIN 索引.

This can be implemented quite efficiently by GiST indexes, but not by GIN indexes.

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆