使用PostgreSQL快速查找相似的字符串 [英] Finding similar strings with PostgreSQL quickly

查看:1196
本文介绍了使用PostgreSQL快速查找相似的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要在表中创建相似字符串的排名.

I need to create a ranking of similar strings in a table.

我有下表

create table names (
name character varying(255)
);

当前,我正在使用提供similarity功能的 pg_trgm 模块,但是我遇到了效率问题.我创建了一个 Postgres手册建议:

Currently, I'm using pg_trgm module which offers the similarity function, but I have an efficiency problem. I created an index like the Postgres manual suggests:

CREATE INDEX trgm_idx ON names USING gist (name gist_trgm_ops);

我正在执行以下查询:

select (similarity(n1.name, n2.name)) as sim, n1.name, n2.name
from names n1, names n2
where n1.name != n2.name and similarity(n1.name, n2.name) > .8
order by sim desc;

查询有效,但是当您有数百个名称时,查询速度确实很慢.此外,也许我忘记了一些SQL,但我不明白为什么为什么在没有出现列sim不存在"错误的情况下不能使用条件and sim > .8的问题.

The query works, but is really slow when you have hundreds of names. Moreover, maybe I forgot a bit of SQL, but I don't understand why I cannot use the condition and sim > .8 without getting a "column sim doesn't exist" error.

我希望有任何提示可以使查询更快.

I'd like any hint to make the query faster.

推荐答案

使用它的方式,必须计算表中每个元素与每个其他元素之间的相似度(几乎是交叉联接).如果您的表有1000行,则已经有1,000,000(!)个相似度计算,可以根据条件在之前进行检查并进行排序.可怕地缩放.

The way you have it, similarity between every element and every other element of the table has to be calculated (almost a cross join). If your table has 1000 rows, that's already 1,000,000 (!) similarity calculations, before those can be checked against the condition and sorted. Scales terribly.

使用 SET pg_trgm.similarity_threshold % 运算符.两者均由 pg_trgm 模块提供.

Use SET pg_trgm.similarity_threshold and the % operator instead. Both are provided by the pg_trgm module.

配置参数 pg_trgm.similarity_threshold 替换了功能 set_limit()show_limit() 在Postgres 9.6中.不推荐使用的功能仍然有效(自Postgres 12起).此外,自Postgres 9.1以来,GIN和GiST索引的性能在许多方面得到了改善.

The configuration parameter pg_trgm.similarity_threshold replaced the functions set_limit() and show_limit() in Postgres 9.6. The deprecated functions still work (as of Postgres 12). Also, performance of GIN and GiST indexes improved in many ways since Postgres 9.1.

请尝试:

SET pg_trgm.similarity_threshold = 0.8; -- Postgres 9.6 or later
-- SELECT set_limit(0.8);               -- for older versions

SELECT similarity(n1.name, n2.name) AS sim, n1.name, n2.name
FROM   names n1
JOIN   names n2 ON n1.name <> n2.name
               AND n1.name % n2.name
ORDER  BY sim DESC;

更快,但仍然很慢.

pg_trgm.similarity_threshold自定义"选项 ,可以像其他任何选项一样进行处理.参见:

pg_trgm.similarity_threshold is a "customized" option, which can be handled like any other option. See:

您可能想通过在交叉连接之前 添加前提条件(例如匹配首字母)来限制可能的对数(并使用匹配的功能索引来支持). 交叉连接的性能随着 O(N²)而变差.

You may want to restrict the number of possible pairs by adding preconditions (like matching first letters) before cross joining (and support that with a matching functional index). The performance of a cross join deteriorates with O(N²).

关于您的附属问题:

WHERE ... sim > 0.8

不起作用,因为您不能引用WHEREHAVING子句中的输出列.这是根据(有点让人困惑的)SQL标准制定的,该标准由某些其他RDBMS相当宽松地处理.

Does not work because you cannot refer to output columns in WHERE or HAVING clauses. That's according to the (a bit confusing, granted) SQL standard - which is handled rather loosely by certain other RDBMS.

另一方面:

ORDER BY sim DESC

有效,因为可以在GROUP BYORDER BY中使用输出列 .详细信息:

Works because output columns can be used in GROUP BY and ORDER BY. Details:

我在旧的测试服务器上进行了快速测试,以验证我的主张.
PostgreSQL 9.1.4.用EXPLAIN ANALYZE拍摄的时间(最好5个).

I ran a quick test on my old test server to verify my claims.
PostgreSQL 9.1.4. Times taken with EXPLAIN ANALYZE (best of 5).

CREATE TEMP table t AS 
SELECT some_col AS name FROM some_table LIMIT 1000;  -- real life test strings

具有GIN索引的第一轮测试:

First round of tests with GIN index:

CREATE INDEX t_gin ON t USING gin(name gin_trgm_ops);  -- round1: with GIN index

具有GIST索引的第二轮测试:

Second round of tests with GIST index:

DROP INDEX t_gin;
CREATE INDEX t_gist ON t USING gist(name gist_trgm_ops);

新查询:

SELECT set_limit(0.8);

SELECT similarity(n1.name, n2.name) AS sim, n1.name, n2.name
FROM   t n1
JOIN   t n2 ON n1.name <> n2.name
           AND n1.name % n2.name
ORDER  BY sim DESC;

已使用GIN索引,命中64次:总运行时间:484.022毫秒
已使用GIST索引,命中64次:总运行时间: 248.772毫秒

GIN index used, 64 hits: total runtime: 484.022 ms
GIST index used, 64 hits: total runtime: 248.772 ms

旧查询:

SELECT (similarity(n1.name, n2.name)) as sim, n1.name, n2.name
FROM   t n1, t n2
WHERE  n1.name != n2.name
AND    similarity(n1.name, n2.name) > 0.8
ORDER  BY sim DESC;

使用了GIN索引,点击64次:总运行时间:6345.833毫秒
已使用GIST索引,命中64次:总运行时间: 6335.975毫秒

GIN index not used, 64 hits: total runtime: 6345.833 ms
GIST index not used, 64 hits: total runtime: 6335.975 ms

否则结果相同.忠告是好的.这是用于仅1000行

Otherwise identical results. Advice is good. And this is for just 1000 rows!

GIN通常提供卓越的读取性能:

GIN often provides superior read performance:

但在这种情况下不是这样的:

这可以通过GiST索引非常有效地实现,但不能通过 GIN索引.

This can be implemented quite efficiently by GiST indexes, but not by GIN indexes.

  • 3个具有异质字段的多列索引数据类型
    • Multicolumn index on 3 fields with heterogenous data types
    • 这篇关于使用PostgreSQL快速查找相似的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆