用于自动完成字段的类似UTF-8字符串 [英] Similar UTF-8 strings for autocomplete field
问题描述
用户可以输入名称,并且系统应该与文本匹配,即使用户输入或数据库字段包含带重音(UTF-8)的字符也是如此.这是使用pg_trgm
模块.
Users can type in a name and the system should match the text, even if the either the user input or the database field contains accented (UTF-8) characters. This is using the pg_trgm
module.
代码类似于以下内容:
SELECT
t.label
FROM
the_table t
WHERE
label % 'fil'
ORDER BY
similarity( t.label, 'fil' ) DESC
当用户键入fil
时,查询匹配filbert
,但不匹配filé powder
. (因为带有重音符号?)
When the user types fil
, the query matches filbert
but not filé powder
. (Because of the accented character?)
我尝试实现 unaccent 函数,并将查询重写为:
I tried to implement an unaccent function and rewrite the query as:
SELECT
t.label
FROM
the_table t
WHERE
unaccent( label ) % unaccent( 'fil' )
ORDER BY
similarity( unaccent( t.label ), unaccent( 'fil' ) ) DESC
这仅返回filbert
.
根据建议:
CREATE EXTENSION pg_trgm;
CREATE EXTENSION unaccent;
CREATE OR REPLACE FUNCTION unaccent_text(text)
RETURNS text AS
$BODY$
SELECT unaccent($1);
$BODY$
LANGUAGE sql IMMUTABLE
COST 1;
表上的所有其他索引均已删除.然后:
All other indexes on the table have been dropped. Then:
CREATE INDEX label_unaccent_idx
ON the_table( lower( unaccent_text( label ) ) );
这仅返回一个结果:
SELECT
t.label
FROM
the_table t
WHERE
label % 'fil'
ORDER BY
similarity( t.label, 'fil' ) DESC
问题
重写查询以确保同时返回两个结果的最佳方法是什么?
Question
What is the best way to rewrite the query to ensure that both results are returned?
谢谢!
http://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL_9 .0#Unaccent_filtering_dictionary
http://postgresql.1045698.n5 .nabble.com/index-refuses-to-build-td5108810.html
推荐答案
您没有使用pg_trgm
模块提供的运算符类.我会创建一个像这样的索引:
You are not using the operator class provided by the pg_trgm
module. I would create an index like this:
CREATE INDEX label_Lower_unaccent_trgm_idx
ON test_trgm USING gist (lower(unaccent_text(label)) gist_trgm_ops);
最初,我在这里有一个GIN索引,但是后来我了解到,GiST可能甚至更适合这种查询,因为它可以返回按相似性排序的值.详细信息:
Originally, I had a GIN index here, but I later learned that a GiST is probably even better suited for this kind of query because it can return values sorted by similarity. More details:
- Postgresql: Matching Patterns between Two Columns
- Finding similar strings with PostgreSQL quickly
您的查询必须匹配索引表达式才能使用它.
Your query has to match the index expression to be able to make use of it.
SELECT label
FROM the_table
WHERE lower(unaccent_text(label)) % 'fil'
ORDER BY similarity(label, 'fil') DESC -- it's ok to use original string here
但是,根据%运算符,榛子"和粉"实际上与"fil"并不十分相似.我怀疑您真正想要的是什么:
However, "filbert" and "filé powder" are not actually very similar to "fil" according to the % operator. I suspect what you really want is this:
SELECT label
FROM the_table
WHERE lower(unaccent_text(label)) ~~ '%fil%'
ORDER BY similarity(label, 'fil') DESC -- it's ok to use original string here
这将查找包含搜索字符串的所有字符串,并首先根据%
运算符对最佳匹配项进行排序.
This will find all strings containing the search string, and sort the best matches according to the %
operator first.
多汁的部分:自PostgreSQL 9.1 起,表达式可以使用GIN或GiST索引!我在pg_trgm模版上引用了手册 :
And the juicy part: the expression can use a GIN or GiST index since PostgreSQL 9.1! I quote the manual on the pg_trgm moule:
从PostgreSQL 9.1开始,这些索引类型也支持索引 搜索LIKE和ILIKE,例如
Beginning in PostgreSQL 9.1, these index types also support index searches for LIKE and ILIKE, for example
如果您实际上打算使用%
运算符:
If you actually meant to use the %
operator:
您是否尝试过降低与set_limit()
:
Have you tried lowering the threshold for the similarity operator %
with set_limit()
:
SELECT set_limit(0.1);
甚至更低?默认值为0.3.只是看看它是否是过滤其他匹配项的阈值.
or even lower? The default is 0.3. Just to see whether its the threshold that filters additional matches.
这篇关于用于自动完成字段的类似UTF-8字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!