用于自动完成字段的类似UTF-8字符串 [英] Similar UTF-8 strings for autocomplete field

查看:110
本文介绍了用于自动完成字段的类似UTF-8字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

用户可以输入名称,并且系统应该与文本匹配,即使用户输入或数据库字段包含带重音(UTF-8)的字符也是如此.这是使用pg_trgm模块.

Users can type in a name and the system should match the text, even if the either the user input or the database field contains accented (UTF-8) characters. This is using the pg_trgm module.

代码类似于以下内容:

  SELECT
    t.label
  FROM
    the_table t
  WHERE
    label % 'fil'
  ORDER BY
    similarity( t.label, 'fil' ) DESC

当用户键入fil时,查询匹配filbert,但不匹配filé powder. (因为带有重音符号?)

When the user types fil, the query matches filbert but not filé powder. (Because of the accented character?)

我尝试实现 unaccent 函数,并将查询重写为:

I tried to implement an unaccent function and rewrite the query as:

  SELECT
    t.label
  FROM
    the_table t
  WHERE
    unaccent( label ) % unaccent( 'fil' )
  ORDER BY
    similarity( unaccent( t.label ), unaccent( 'fil' ) ) DESC

这仅返回filbert.

根据建议:

CREATE EXTENSION pg_trgm;
CREATE EXTENSION unaccent;

CREATE OR REPLACE FUNCTION unaccent_text(text)
  RETURNS text AS
$BODY$
  SELECT unaccent($1); 
$BODY$
  LANGUAGE sql IMMUTABLE
  COST 1;

表上的所有其他索引均已删除.然后:

All other indexes on the table have been dropped. Then:

CREATE INDEX label_unaccent_idx 
ON the_table( lower( unaccent_text( label ) ) );

这仅返回一个结果:

  SELECT
    t.label
  FROM
    the_table t
  WHERE
    label % 'fil'
  ORDER BY
    similarity( t.label, 'fil' ) DESC

问题

重写查询以确保同时返回两个结果的最佳方法是什么?

Question

What is the best way to rewrite the query to ensure that both results are returned?

谢谢!

http://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL_9 .0#Unaccent_filtering_dictionary

http://postgresql.1045698.n5 .nabble.com/index-refuses-to-build-td5108810.html

推荐答案

您没有使用pg_trgm模块提供的运算符类.我会创建一个像这样的索引:

You are not using the operator class provided by the pg_trgm module. I would create an index like this:


CREATE INDEX label_Lower_unaccent_trgm_idx
ON test_trgm USING gist (lower(unaccent_text(label)) gist_trgm_ops);

最初,我在这里有一个GIN索引,但是后来我了解到,GiST可能甚至更适合这种查询,因为它可以返回按相似性排序的值.详细信息:

Originally, I had a GIN index here, but I later learned that a GiST is probably even better suited for this kind of query because it can return values sorted by similarity. More details:

  • Postgresql: Matching Patterns between Two Columns
  • Finding similar strings with PostgreSQL quickly

您的查询必须匹配索引表达式才能使用它.

Your query has to match the index expression to be able to make use of it.

SELECT label
FROM   the_table
WHERE  lower(unaccent_text(label)) % 'fil'
ORDER  BY similarity(label, 'fil') DESC -- it's ok to use original string here

但是,根据%运算符,榛子"和粉"实际上与"fil"并不十分相似.我怀疑您真正想要的是什么:

However, "filbert" and "filé powder" are not actually very similar to "fil" according to the % operator. I suspect what you really want is this:


SELECT label
FROM   the_table
WHERE  lower(unaccent_text(label)) ~~ '%fil%'
ORDER  BY similarity(label, 'fil') DESC -- it's ok to use original string here

这将查找包含搜索字符串的所有字符串,并首先根据%运算符对最佳匹配项进行排序.

This will find all strings containing the search string, and sort the best matches according to the % operator first.

多汁的部分:自PostgreSQL 9.1 起,表达式可以使用GIN或GiST索引!我在pg_trgm模版上引用了手册 :

And the juicy part: the expression can use a GIN or GiST index since PostgreSQL 9.1! I quote the manual on the pg_trgm moule:

从PostgreSQL 9.1开始,这些索引类型也支持索引 搜索LIKE和ILIKE,例如

Beginning in PostgreSQL 9.1, these index types also support index searches for LIKE and ILIKE, for example


如果您实际上打算使用%运算符:


If you actually meant to use the % operator:

您是否尝试过降低与%的阈值 -FUNC-TABLE"rel =" nofollow noreferrer> set_limit() :

Have you tried lowering the threshold for the similarity operator % with set_limit():

SELECT set_limit(0.1);

甚至更低?默认值为0.3.只是看看它是否是过滤其他匹配项的阈值.

or even lower? The default is 0.3. Just to see whether its the threshold that filters additional matches.

这篇关于用于自动完成字段的类似UTF-8字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆