PostgreSQL,找到的字符串相差n个字符 [英] PostgreSQL, find strings differ by n characters

查看:196
本文介绍了PostgreSQL,找到的字符串相差n个字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个这样的表

id  data
1   0001
2   1000
3   2010
4   0120
5   0020
6   0002

sql小提琴演示

id是主键,data是固定长度的字符串,其中字符可以是0、1、2.

id is primary key, data is fixed length string where characters could be 0, 1, 2.

有没有一种建立索引的方法,因此我可以快速找到与给定字符串相差n个字符的字符串?像字符串0001n = 1一样,我想获得第6行.

Is there a way to build an index so I could quickly find strings which are differ by n characters from given string? like for string 0001 and n = 1 I want to get row 6.

谢谢.

推荐答案

提供Fuzzystrmatch .它完全符合您的要求:

There is the levenshtein() function, provided by the additional module fuzzystrmatch. It does exactly what you are asking for:

SELECT *
FROM   a
WHERE  levenshtein(data, '1110') = 1;

SQL小提琴.

但这不是很快.大表比较慢,因为它不能使用索引.

But it is not very fast. Slow with big tables, because it can't use an index.

您可能会到达由附加模块

You might get somewhere with the similarity or distance operators provided by the additional module pg_trgm. Those can use a trigram index as detailed in the linked manual pages. I did not get anywhere, the module is using a different definition of "similarity".

通常,问题似乎适用于 KNN("k个最近的邻居")搜索模式.

Generally the problem seems to fit in the KNN ("k nearest neighbours") search pattern.

如果您的情况与问题中的示例一样简单,则可以将LIKE与Trigram GIN索引结合使用,对于大型表,该索引应该相当快:

If your case is as simple as the example in the question, you can use LIKE in combination with a trigram GIN index, which should be reasonably fast with big tables:

SELECT *
FROM   a
WHERE  data <> '1110'
AND   (data LIKE '_110' OR
       data LIKE '1_10' OR
       data LIKE '11_0' OR
       data LIKE '111_');

很明显,如果字符串更长且差异大于1,则该技术很快变得不可行.

Obviously, this technique quickly becomes unfeasible with longer strings and more than 1 difference.

但是,由于字符串是如此之短,因此任何查询都将匹配基本表的很大一部分.因此,索引支持几乎不会给您带来任何好处.在大多数情况下,Postgres顺序扫描会更快.

However, since the string is so short, any query will match a rather big percentage of the base table. Therefore, index support will hardly buy you anything. Most of the time it will be faster for Postgres to scan sequentially.

我测试了10k和100k行,带有和不带有Trigram GIN索引.由于〜19%符合给定测试用例的标准,因此顺序扫描更快,并且levenshtein()仍然获胜.对于匹配少于少于5%的行(取决于行)的更具选择性的查询,使用索引的查询要快得多.

I tested with 10k and 100k rows with and without a trigram GIN index. Since ~ 19% match the criteria for the given test case, a sequential scan is faster and levenshtein() still wins. For more selective queries matching less than around 5 % of the rows (depends), a query using an index is (much) faster.

这篇关于PostgreSQL,找到的字符串相差n个字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆