使用 PostgreSQL 快速查找相似字符串 [英] Finding similar strings with PostgreSQL quickly
问题描述
我需要在表中创建相似字符串的排名.
I need to create a ranking of similar strings in a table.
我有下表
create table names (
name character varying(255)
);
目前,我正在使用提供 similarity
功能的 pg_trgm 模块,但我遇到了效率问题.我创建了一个索引,如 Postgres 手册建议:
Currently, I'm using pg_trgm module which offers the similarity
function, but I have an efficiency problem. I created an index like the Postgres manual suggests:
CREATE INDEX trgm_idx ON names USING gist (name gist_trgm_ops);
我正在执行以下查询:
select (similarity(n1.name, n2.name)) as sim, n1.name, n2.name
from names n1, names n2
where n1.name != n2.name and similarity(n1.name, n2.name) > .8
order by sim desc;
该查询有效,但当您有数百个名称时,速度确实很慢.此外,也许我忘记了一些 SQL,但我不明白为什么我不能使用条件 和 sim >.8
没有收到列 sim 不存在"错误.
The query works, but is really slow when you have hundreds of names. Moreover, maybe I forgot a bit of SQL, but I don't understand why I cannot use the condition and sim > .8
without getting a "column sim doesn't exist" error.
我想要任何提示以加快查询速度.
I'd like any hint to make the query faster.
推荐答案
按照您的方式,必须计算表中每个元素与每个其他元素之间的相似性(几乎是交叉连接).如果您的表有 1000 行,那已经是 1,000,000 (!) 次相似性计算,之前可以根据条件检查并排序.缩放得非常厉害.
The way you have it, similarity between every element and every other element of the table has to be calculated (almost a cross join). If your table has 1000 rows, that's already 1,000,000 (!) similarity calculations, before those can be checked against the condition and sorted. Scales terribly.
使用SET pg_trgm.similarity_threshold
和 %
运算符代替.两者都由 pg_trgm
模块提供.这样一来,三元组 GiST 索引就可以发挥很大的作用.
Use SET pg_trgm.similarity_threshold
and the %
operator instead. Both are provided by the pg_trgm
module. This way, a trigram GiST index can be used to great effect.
配置参数pg_trgm.similarity_threshold
替换了函数 set_limitPostgres 9.6 中的 ()
和 show_limit()
.不推荐使用的函数仍然有效(从 Postgres 13 开始).此外,自 Postgres 9.1 以来,GIN 和 GiST 索引的性能在许多方面都有所提升.
The configuration parameter pg_trgm.similarity_threshold
replaced the functions set_limit()
and show_limit()
in Postgres 9.6. The deprecated functions still work (as of Postgres 13). Also, performance of GIN and GiST indexes improved in many ways since Postgres 9.1.
尝试:
SET pg_trgm.similarity_threshold = 0.8; -- Postgres 9.6 or later
SELECT similarity(n1.name, n2.name) AS sim, n1.name, n2.name
FROM names n1
JOIN names n2 ON n1.name <> n2.name
AND n1.name % n2.name
ORDER BY sim DESC;
快了几个数量级,但仍然很慢.
Faster by orders of magnitude, but still slow.
pg_trgm.similarity_threshold
是一个 "定制"选项,可以像任何其他选项一样处理.见:
pg_trgm.similarity_threshold
is a "customized" option, which can be handled like any other option. See:
您可能希望通过在交叉连接之前添加先决条件(例如匹配首字母)来限制可能对的数量(并通过匹配的功能索引支持).交叉连接的性能随着O(N²)而下降.
You may want to restrict the number of possible pairs by adding preconditions (like matching first letters) before cross joining (and support that with a matching functional index). The performance of a cross join deteriorates with O(N²).
此不起作用,因为您不能在WHERE
或HAVING
子句中引用输出列:
This does not work because you cannot refer to output columns in WHERE
or HAVING
clauses:
WHERE ... sim > 0.8
这是根据 SQL 标准(由某些其他 RDBMS 相当松散地处理).另一方面:
That's according to the SQL standard (which is handled rather loosely by certain other RDBMS). On the other hand:
ORDER BY sim DESC
有效,因为输出列可以用于GROUP BY
和ORDER BY
.见:
Works because output columns can be used in GROUP BY
and ORDER BY
. See:
我在旧测试服务器上进行了快速测试以验证我的声明.
PostgreSQL 9.1.4.EXPLAIN ANALYZE
花费的时间(最好的 5 个).
I ran a quick test on my old test server to verify my claims.
PostgreSQL 9.1.4. Times taken with EXPLAIN ANALYZE
(best of 5).
CREATE TEMP table t AS
SELECT some_col AS name FROM some_table LIMIT 1000; -- real life test strings
使用 GIN 索引进行第一轮测试:
First round of tests with GIN index:
CREATE INDEX t_gin ON t USING gin(name gin_trgm_ops); -- round1: with GIN index
使用 GIST 索引的第二轮测试:
Second round of tests with GIST index:
DROP INDEX t_gin;
CREATE INDEX t_gist ON t USING gist(name gist_trgm_ops);
新查询:
SELECT set_limit(0.8);
SELECT similarity(n1.name, n2.name) AS sim, n1.name, n2.name
FROM t n1
JOIN t n2 ON n1.name <> n2.name
AND n1.name % n2.name
ORDER BY sim DESC;
使用 GIN 索引,64 次命中:总运行时间:484.022 毫秒
使用 GIST 索引,64 次点击:总运行时间:248.772 毫秒
GIN index used, 64 hits: total runtime: 484.022 ms
GIST index used, 64 hits: total runtime: 248.772 ms
旧查询:
SELECT (similarity(n1.name, n2.name)) as sim, n1.name, n2.name
FROM t n1, t n2
WHERE n1.name != n2.name
AND similarity(n1.name, n2.name) > 0.8
ORDER BY sim DESC;
GIN 索引未使用,64 次点击:总运行时间:6345.833 毫秒
GIST 索引未使用,64 次点击:总运行时间:6335.975 毫秒
GIN index not used, 64 hits: total runtime: 6345.833 ms
GIST index not used, 64 hits: total runtime: 6335.975 ms
其他相同的结果.建议很好.这是仅 1000 行!
Otherwise identical results. Advice is good. And this is for just 1000 rows!
GIN 通常提供卓越的读取性能:
GIN often provides superior read performance:
这可以通过 GiST 索引非常有效地实现,但不能通过GIN 索引.
This can be implemented quite efficiently by GiST indexes, but not by GIN indexes.
- 具有异构的 3 个字段的多列索引数据类型
这篇关于使用 PostgreSQL 快速查找相似字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!