用于自动完成字段的类似UTF-8字符串 [英] Similar UTF-8 strings for autocomplete field

查看：110 发布时间：2020/5/28 18:56:44 postgresql utf-8 plpgsql string-comparison similarity

本文介绍了用于自动完成字段的类似UTF-8字符串的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

用户可以输入名称，并且系统应该与文本匹配，即使用户输入或数据库字段包含带重音(UTF-8)的字符也是如此.这是使用pg_trgm模块.

Users can type in a name and the system should match the text, even if the either the user input or the database field contains accented (UTF-8) characters. This is using the pg_trgm module.

代码类似于以下内容:

  SELECT
    t.label
  FROM
    the_table t
  WHERE
    label % 'fil'
  ORDER BY
    similarity( t.label, 'fil' ) DESC

当用户键入fil时，查询匹配filbert，但不匹配filé powder. (因为带有重音符号?)

When the user types fil, the query matches filbert but not filé powder. (Because of the accented character?)

我尝试实现 unaccent 函数，并将查询重写为:

I tried to implement an unaccent function and rewrite the query as:

  SELECT
    t.label
  FROM
    the_table t
  WHERE
    unaccent( label ) % unaccent( 'fil' )
  ORDER BY
    similarity( unaccent( t.label ), unaccent( 'fil' ) ) DESC

这仅返回filbert.

根据建议:

CREATE EXTENSION pg_trgm;
CREATE EXTENSION unaccent;

CREATE OR REPLACE FUNCTION unaccent_text(text)
  RETURNS text AS
$BODY$
  SELECT unaccent($1); 
$BODY$
  LANGUAGE sql IMMUTABLE
  COST 1;

表上的所有其他索引均已删除.然后:

All other indexes on the table have been dropped. Then:

CREATE INDEX label_unaccent_idx 
ON the_table( lower( unaccent_text( label ) ) );

这仅返回一个结果:

  SELECT
    t.label
  FROM
    the_table t
  WHERE
    label % 'fil'
  ORDER BY
    similarity( t.label, 'fil' ) DESC

问题

重写查询以确保同时返回两个结果的最佳方法是什么?

Question

What is the best way to rewrite the query to ensure that both results are returned?

谢谢！

http://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL_9 .0#Unaccent_filtering_dictionary

http://postgresql.1045698.n5 .nabble.com/index-refuses-to-build-td5108810.html

推荐答案

您没有使用pg_trgm模块提供的运算符类.我会创建一个像这样的索引:

You are not using the operator class provided by the pg_trgm module. I would create an index like this:


CREATE INDEX label_Lower_unaccent_trgm_idx
ON test_trgm USING gist (lower(unaccent_text(label)) gist_trgm_ops);

最初，我在这里有一个GIN索引，但是后来我了解到，GiST可能甚至更适合这种查询，因为它可以返回按相似性排序的值.详细信息:

Originally, I had a GIN index here, but I later learned that a GiST is probably even better suited for this kind of query because it can return values sorted by similarity. More details:

Postgresql: Matching Patterns between Two Columns
Finding similar strings with PostgreSQL quickly

您的查询必须匹配索引表达式才能使用它.

Your query has to match the index expression to be able to make use of it.

SELECT label
FROM   the_table
WHERE  lower(unaccent_text(label)) % 'fil'
ORDER  BY similarity(label, 'fil') DESC -- it's ok to use original string here

但是，根据％运算符，榛子"和粉"实际上与"fil"并不十分相似.我怀疑您真正想要的是什么:

However, "filbert" and "filé powder" are not actually very similar to "fil" according to the % operator. I suspect what you really want is this:


SELECT label
FROM   the_table
WHERE  lower(unaccent_text(label)) ~~ '%fil%'
ORDER  BY similarity(label, 'fil') DESC -- it's ok to use original string here

这将查找包含搜索字符串的所有字符串，并首先根据%运算符对最佳匹配项进行排序.

This will find all strings containing the search string, and sort the best matches according to the % operator first.

多汁的部分:自PostgreSQL 9.1 起，表达式可以使用GIN或GiST索引！我在pg_trgm模版上引用了手册 :

And the juicy part: the expression can use a GIN or GiST index since PostgreSQL 9.1! I quote the manual on the pg_trgm moule:

从PostgreSQL 9.1开始，这些索引类型也支持索引搜索LIKE和ILIKE，例如

Beginning in PostgreSQL 9.1, these index types also support index searches for LIKE and ILIKE, for example

如果您实际上打算使用%运算符:

If you actually meant to use the % operator:

您是否尝试过降低与%的阈值 -FUNC-TABLE"rel =" nofollow noreferrer> set_limit() :

Have you tried lowering the threshold for the similarity operator % with set_limit():

SELECT set_limit(0.1);

甚至更低?默认值为0.3.只是看看它是否是过滤其他匹配项的阈值.

or even lower? The default is 0.3. Just to see whether its the threshold that filters additional matches.

这篇关于用于自动完成字段的类似UTF-8字符串的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

用于自动完成字段的类似UTF-8字符串 [英] Similar UTF-8 strings for autocomplete field

问题描述

问题

Question

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

用于自动完成字段的类似UTF-8字符串 [英] Similar UTF-8 strings for autocomplete field

问题描述

问题

Question

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭