使用PostgreSQL快速查找相似的字符串 [英] Finding similar strings with PostgreSQL quickly

查看：1196 发布时间：2020/5/29 19:45:48 sql postgresql text similarity postgresql-performance

本文介绍了使用PostgreSQL快速查找相似的字符串的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要在表中创建相似字符串的排名.

I need to create a ranking of similar strings in a table.

我有下表

create table names (
name character varying(255)
);

当前，我正在使用提供similarity功能的 pg_trgm 模块，但是我遇到了效率问题.我创建了一个 Postgres手册建议:

Currently, I'm using pg_trgm module which offers the similarity function, but I have an efficiency problem. I created an index like the Postgres manual suggests:

CREATE INDEX trgm_idx ON names USING gist (name gist_trgm_ops);

我正在执行以下查询:

select (similarity(n1.name, n2.name)) as sim, n1.name, n2.name
from names n1, names n2
where n1.name != n2.name and similarity(n1.name, n2.name) > .8
order by sim desc;

查询有效，但是当您有数百个名称时，查询速度确实很慢.此外，也许我忘记了一些SQL，但我不明白为什么为什么在没有出现列sim不存在"错误的情况下不能使用条件and sim > .8的问题.

The query works, but is really slow when you have hundreds of names. Moreover, maybe I forgot a bit of SQL, but I don't understand why I cannot use the condition and sim > .8 without getting a "column sim doesn't exist" error.

我希望有任何提示可以使查询更快.

I'd like any hint to make the query faster.

推荐答案

使用它的方式，必须计算表中每个元素与每个其他元素之间的相似度(几乎是交叉联接).如果您的表有1000行，则已经有1,000,000(！)个相似度计算，可以根据条件在之前进行检查并进行排序.可怕地缩放.

The way you have it, similarity between every element and every other element of the table has to be calculated (almost a cross join). If your table has 1000 rows, that's already 1,000,000 (!) similarity calculations, before those can be checked against the condition and sorted. Scales terribly.

使用 SET pg_trgm.similarity_threshold 和 % 运算符.两者均由 pg_trgm 模块提供.

Use SET pg_trgm.similarity_threshold and the % operator instead. Both are provided by the pg_trgm module.

配置参数 pg_trgm.similarity_threshold 替换了功能 set_limit()和show_limit() 在Postgres 9.6中.不推荐使用的功能仍然有效(自Postgres 12起).此外，自Postgres 9.1以来，GIN和GiST索引的性能在许多方面得到了改善.

The configuration parameter pg_trgm.similarity_threshold replaced the functions set_limit() and show_limit() in Postgres 9.6. The deprecated functions still work (as of Postgres 12). Also, performance of GIN and GiST indexes improved in many ways since Postgres 9.1.

请尝试:

SET pg_trgm.similarity_threshold = 0.8; -- Postgres 9.6 or later -- SELECT set_limit(0.8); -- for older versions SELECT similarity(n1.name, n2.name) AS sim, n1.name, n2.name FROM names n1 JOIN names n2 ON n1.name <> n2.name AND n1.name % n2.name ORDER BY sim DESC;

更快，但仍然很慢.

pg_trgm.similarity_threshold是自定义"选项，可以像其他任何选项一样进行处理.参见:

pg_trgm.similarity_threshold is a "customized" option, which can be handled like any other option. See:

查询参数(postgresql.conf设置)，例如"max_connections"

您可能想通过在交叉连接之前添加前提条件(例如匹配首字母)来限制可能的对数(并使用匹配的功能索引来支持). 交叉连接的性能随着 O(N²)而变差.

You may want to restrict the number of possible pairs by adding preconditions (like matching first letters) before cross joining (and support that with a matching functional index). The performance of a cross join deteriorates with O(N²).

关于您的附属问题:

WHERE ... sim > 0.8

不起作用，因为您不能引用WHERE或HAVING子句中的输出列.这是根据(有点让人困惑的)SQL标准制定的，该标准由某些其他RDBMS相当宽松地处理.

Does not work because you cannot refer to output columns in WHERE or HAVING clauses. That's according to the (a bit confusing, granted) SQL standard - which is handled rather loosely by certain other RDBMS.

另一方面:

ORDER BY sim DESC

有效，因为可以在GROUP BY和ORDER BY中使用输出列 .详细信息:

Works because output columns can be used in GROUP BY and ORDER BY. Details:

PostgreSQL在选择查询中重用计算结果

我在旧的测试服务器上进行了快速测试，以验证我的主张.
PostgreSQL 9.1.4.用EXPLAIN ANALYZE拍摄的时间(最好5个).

I ran a quick test on my old test server to verify my claims.
PostgreSQL 9.1.4. Times taken with EXPLAIN ANALYZE (best of 5).

CREATE TEMP table t AS SELECT some_col AS name FROM some_table LIMIT 1000; -- real life test strings

具有GIN索引的第一轮测试:

First round of tests with GIN index:

CREATE INDEX t_gin ON t USING gin(name gin_trgm_ops); -- round1: with GIN index

具有GIST索引的第二轮测试:

Second round of tests with GIST index:

DROP INDEX t_gin; CREATE INDEX t_gist ON t USING gist(name gist_trgm_ops);

新查询:

SELECT set_limit(0.8); SELECT similarity(n1.name, n2.name) AS sim, n1.name, n2.name FROM t n1 JOIN t n2 ON n1.name <> n2.name AND n1.name % n2.name ORDER BY sim DESC;

已使用GIN索引，命中64次:总运行时间:484.022毫秒
已使用GIST索引，命中64次:总运行时间: 248.772毫秒

GIN index used, 64 hits: total runtime: 484.022 ms
GIST index used, 64 hits: total runtime: 248.772 ms

旧查询:

SELECT (similarity(n1.name, n2.name)) as sim, n1.name, n2.name FROM t n1, t n2 WHERE n1.name != n2.name AND similarity(n1.name, n2.name) > 0.8 ORDER BY sim DESC;

使用了GIN索引未，点击64次:总运行时间:6345.833毫秒
已使用GIST索引未，命中64次:总运行时间: 6335.975毫秒

GIN index not used, 64 hits: total runtime: 6345.833 ms
GIST index not used, 64 hits: total runtime: 6335.975 ms

否则结果相同.忠告是好的.这是用于仅1000行！

Otherwise identical results. Advice is good. And this is for just 1000 rows!

GIN通常提供卓越的读取性能:

GIN often provides superior read performance:

GiST和GIN索引之间的差异

但在这种情况下不是这样的:

这可以通过GiST索引非常有效地实现，但不能通过 GIN索引.

This can be implemented quite efficiently by GiST indexes, but not by GIN indexes.

3个具有异质字段的多列索引数据类型

Multicolumn index on 3 fields with heterogenous data types

这篇关于使用PostgreSQL快速查找相似的字符串的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用PostgreSQL快速查找相似的字符串 [英] Finding similar strings with PostgreSQL quickly

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用PostgreSQL快速查找相似的字符串 [英] Finding similar strings with PostgreSQL quickly

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭