使用PostgreSQL快速查找相似的字符串 [英] Finding similar strings with PostgreSQL quickly
问题描述
我需要在表中创建相似字符串的排名.
I need to create a ranking of similar strings in a table.
我有下表
create table names (
name character varying(255)
);
当前,我正在使用提供similarity
功能的 pg_trgm 模块,但是我遇到了效率问题.我创建了一个 Postgres手册建议:
Currently, I'm using pg_trgm module which offers the similarity
function, but I have an efficiency problem. I created an index like the Postgres manual suggests:
CREATE INDEX trgm_idx ON names USING gist (name gist_trgm_ops);
我正在执行以下查询:
select (similarity(n1.name, n2.name)) as sim, n1.name, n2.name
from names n1, names n2
where n1.name != n2.name and similarity(n1.name, n2.name) > .8
order by sim desc;
查询有效,但是当您有数百个名称时,查询速度确实很慢.此外,也许我忘记了一些SQL,但我不明白为什么为什么在没有出现列sim不存在"错误的情况下不能使用条件and sim > .8
的问题.
The query works, but is really slow when you have hundreds of names. Moreover, maybe I forgot a bit of SQL, but I don't understand why I cannot use the condition and sim > .8
without getting a "column sim doesn't exist" error.
我希望有任何提示可以使查询更快.
I'd like any hint to make the query faster.
推荐答案
使用它的方式,必须计算表中每个元素与每个其他元素之间的相似度(几乎是交叉联接).如果您的表有1000行,则已经有1,000,000(!)个相似度计算,可以根据条件在之前进行检查并进行排序.可怕地缩放.
The way you have it, similarity between every element and every other element of the table has to be calculated (almost a cross join). If your table has 1000 rows, that's already 1,000,000 (!) similarity calculations, before those can be checked against the condition and sorted. Scales terribly.
使用 SET pg_trgm.similarity_threshold
和 %
运算符.两者均由 pg_trgm
模块提供.
Use SET pg_trgm.similarity_threshold
and the %
operator instead. Both are provided by the pg_trgm
module.
配置参数 pg_trgm.similarity_threshold
替换了功能 set_limit()
和show_limit()
在Postgres 9.6中.不推荐使用的功能仍然有效(自Postgres 12起).此外,自Postgres 9.1以来,GIN和GiST索引的性能在许多方面得到了改善.
The configuration parameter pg_trgm.similarity_threshold
replaced the functions set_limit()
and show_limit()
in Postgres 9.6. The deprecated functions still work (as of Postgres 12). Also, performance of GIN and GiST indexes improved in many ways since Postgres 9.1.
请尝试:
SET pg_trgm.similarity_threshold = 0.8; -- Postgres 9.6 or later
-- SELECT set_limit(0.8); -- for older versions
SELECT similarity(n1.name, n2.name) AS sim, n1.name, n2.name
FROM names n1
JOIN names n2 ON n1.name <> n2.name
AND n1.name % n2.name
ORDER BY sim DESC;
更快,但仍然很慢.
pg_trgm.similarity_threshold
是自定义"选项 ,可以像其他任何选项一样进行处理.参见:
pg_trgm.similarity_threshold
is a "customized" option, which can be handled like any other option. See:
您可能想通过在交叉连接之前 添加前提条件(例如匹配首字母)来限制可能的对数(并使用匹配的功能索引来支持). 交叉连接的性能随着 O(N²)而变差.
You may want to restrict the number of possible pairs by adding preconditions (like matching first letters) before cross joining (and support that with a matching functional index). The performance of a cross join deteriorates with O(N²).
关于您的附属问题:
WHERE ... sim > 0.8
不起作用,因为您不能引用WHERE
或HAVING
子句中的输出列.这是根据(有点让人困惑的)SQL标准制定的,该标准由某些其他RDBMS相当宽松地处理.
Does not work because you cannot refer to output columns in WHERE
or HAVING
clauses. That's according to the (a bit confusing, granted) SQL standard - which is handled rather loosely by certain other RDBMS.
另一方面:
ORDER BY sim DESC
有效,因为可以在GROUP BY
和ORDER BY
中使用输出列 .详细信息:
Works because output columns can be used in GROUP BY
and ORDER BY
. Details:
我在旧的测试服务器上进行了快速测试,以验证我的主张.
PostgreSQL 9.1.4.用EXPLAIN ANALYZE
拍摄的时间(最好5个).
I ran a quick test on my old test server to verify my claims.
PostgreSQL 9.1.4. Times taken with EXPLAIN ANALYZE
(best of 5).
CREATE TEMP table t AS
SELECT some_col AS name FROM some_table LIMIT 1000; -- real life test strings
具有GIN索引的第一轮测试:
First round of tests with GIN index:
CREATE INDEX t_gin ON t USING gin(name gin_trgm_ops); -- round1: with GIN index
具有GIST索引的第二轮测试:
Second round of tests with GIST index:
DROP INDEX t_gin;
CREATE INDEX t_gist ON t USING gist(name gist_trgm_ops);
新查询:
SELECT set_limit(0.8);
SELECT similarity(n1.name, n2.name) AS sim, n1.name, n2.name
FROM t n1
JOIN t n2 ON n1.name <> n2.name
AND n1.name % n2.name
ORDER BY sim DESC;
已使用GIN索引,命中64次:总运行时间:484.022毫秒
已使用GIST索引,命中64次:总运行时间: 248.772毫秒
GIN index used, 64 hits: total runtime: 484.022 ms
GIST index used, 64 hits: total runtime: 248.772 ms
旧查询:
SELECT (similarity(n1.name, n2.name)) as sim, n1.name, n2.name
FROM t n1, t n2
WHERE n1.name != n2.name
AND similarity(n1.name, n2.name) > 0.8
ORDER BY sim DESC;
使用了GIN索引未,点击64次:总运行时间:6345.833毫秒
已使用GIST索引未,命中64次:总运行时间: 6335.975毫秒
GIN index not used, 64 hits: total runtime: 6345.833 ms
GIST index not used, 64 hits: total runtime: 6335.975 ms
否则结果相同.忠告是好的.这是用于仅1000行!
Otherwise identical results. Advice is good. And this is for just 1000 rows!
GIN通常提供卓越的读取性能:
GIN often provides superior read performance:
这可以通过GiST索引非常有效地实现,但不能通过 GIN索引.
This can be implemented quite efficiently by GiST indexes, but not by GIN indexes.
- 3个具有异质字段的多列索引数据类型
- Multicolumn index on 3 fields with heterogenous data types
这篇关于使用PostgreSQL快速查找相似的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!