PostgreSQL是否支持“重音不敏感”?排序规则? [英] Does PostgreSQL support "accent insensitive" collations?

查看:139
本文介绍了PostgreSQL是否支持“重音不敏感”?排序规则?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Microsoft SQL Server中,可以指定重音不敏感排序规则(对于数据库,表或列),这意味着可以进行查询,例如

In Microsoft SQL Server, it's possible to specify an "accent insensitive" collation (for a database, table or column), which means that it's possible for a query like

SELECT * FROM users WHERE name LIKE 'João'

找到一个 Joao 名称的行。

我知道可以使用PostgreSQL中的字符串去除重音符号 unaccent_string contrib函数,但我想知道PostgreSQL是否支持这些accent insensitive排序规则,所以<$ c上面的$ c> SELECT 可行。

I know that it's possible to strip accents from strings in PostgreSQL using the unaccent_string contrib function, but I'm wondering if PostgreSQL supports these "accent insensitive" collations so the SELECT above would work.

推荐答案

使用 unaccent module - 这与你链接到的完全不同。

Use the unaccent module for that - which is completely different from what you are linking to.


unaccent是一个文本搜索字典,用于从词位中删除重音符号(变音符号
符号)。

unaccent is a text search dictionary that removes accents (diacritic signs) from lexemes.

每个数据库安装一次:

CREATE EXTENSION unaccent;

如果收到如下错误:


错误:无法打开扩展控制文件
/usr/share/postgresql/9.x/extension/unaccent.control:没有这样的文件
或目录

ERROR: could not open extension control file "/usr/share/postgresql/9.x/extension/unaccent.control": No such file or directory

在数据库服务器上安装contrib包,就像这个相关答案中的指示一样:

Install the contrib package on your database server like instructed in this related answer:

  • Error when creating unaccent extension on PostgreSQL

除此之外,它提供了函数 unaccent()你可以与你的例子一起使用(其中 LIKE 似乎不需要)。

Among other things, it provides the function unaccent() you can use with your example (where LIKE seems not needed).

SELECT *
FROM   users
WHERE  unaccent(name) = unaccent('João');



索引



使用索引这种查询,创建一个表达式索引但是,Postgres只接受索引的 IMMUTABLE 函数。如果函数可以为同一输入返回不同的结果,则索引可能会静默中断。

Index

To use an index for that kind of query, create an index on the expression. However, Postgres only accepts IMMUTABLE functions for indexes. If a function can return a different result for the same input, the index could silently break.

不幸的是, unaccent() STABLE ,而不是 IMMUTABLE 。根据这个线程在pgsql-bugs上,这是由于三个的原因:

Unfortunately, unaccent() is only STABLE, not IMMUTABLE. According to this thread on pgsql-bugs, this is due to three reasons:


  1. 这取决于字典的行为。

  2. 此字典没有硬连线连接。

  3. 因此它还取决于当前的 search_path ,可以轻松更改。

  1. It depends on the behavior of a dictionary.
  2. There is no hard-wired connection to this dictionary.
  3. It therefore also depends on the current search_path, which can change easily.

一些教程指示只是将函数波动率改为 IMMUTABLE 。这种强力方法在某些条件下会破裂。

Some tutorials on the web instruct to just alter the function volatility to IMMUTABLE. This brute-force method can break under certain conditions.

其他人建议简单 IMMUTABLE 包装函数(如我过去做过自己。)

Others suggest a simple IMMUTABLE wrapper function (like I did myself in the past).

目前还在争论是否要制作带有两个参数的变体 IMMUTABLE 明确声明使用的字典。阅读此处这里

There is an ongoing debate whether to make the variant with two parameters IMMUTABLE which declares the used dictionary explicitly. Read here or here.

另一个替代方案是这个模块带有 IMMUTABLE unaccent()由Musicbrainz提供的功能,在Github上提供。没有自己测试过。我想我已经提出 更好的主意

Another alternative would be this module with an IMMUTABLE unaccent() function by Musicbrainz, provided on Github. Haven't tested it myself. I think I have come up with a better idea:

我建议一种方法至少与其他解决方案一样有效,但更安全
使用双参数形式创建包装函数并且 hard-wire函数和字典的模式:

I propose an approach that is at least as efficient as other solutions floating around, but safer: Create a wrapper function with the two-parameter form and "hard-wire" the schema for function and dictionary:

CREATE OR REPLACE FUNCTION f_unaccent(text)
  RETURNS text AS
$func$
SELECT public.unaccent('public.unaccent', $1)  -- schema-qualify function and dictionary
$func$  LANGUAGE sql IMMUTABLE;

public 作为架构的地方您安装了扩展程序( public 是默认设置)。

public being the schema where you installed the extension (public is the default).

以前,我添加了 SET search_path = public,pg_temp 到函数 - 直到我发现字典也可以是模式限定的,目前(第10页)未记录。在我的第9.5行和第10页的测试中,这个版本有点短,大约快两倍。

Previously, I had added SET search_path = public, pg_temp to the function - until I discovered that the dictionary can be schema-qualified, too, which is currently (pg 10) not documented. This version is a bit shorter and around twice as fast in my tests on pg 9.5 and pg 10.

更新后的版本仍然不允许函数内联因为声明的函数 IMMUTABLE 可能无法调用身体中的非不可变函数允许这样做。在我们使用 表达式索引<时,对性能几乎无关紧要/ a> 在此 IMMUTABLE 函数:

The updated version still doesn't allow function inlining because functions declared IMMUTABLE may not call non-immutable functions in the body to allow that. Hardly matters for performance while we make use of an expression index on this IMMUTABLE function:

CREATE INDEX users_unaccent_name_idx ON users(f_unaccent(name));

调整查询以匹配索引(以便查询计划员可以使用它):

Adapt your queries to match the index (so the query planner can use it):

SELECT * FROM users
WHERE  f_unaccent(name) = f_unaccent('João');

您不需要右表达式中的函数。您可以直接提供非重音字符串,例如'Joao'

You don't need the function in the right expression. You can supply unaccented strings like 'Joao' directly.

在Postgres 9.5或更早中,必须手动扩展Œ或ß等连字符(如果需要),因为 unaccent()总是替换字母:

In Postgres 9.5 or older ligatures like 'Œ' or 'ß' have to be expanded manually (if you need that), since unaccent() always substitutes a single letter:

SELECT unaccent('Œ Æ œ æ ß');

unaccent
----------
E A e a S

你会喜欢此更新为unaccent 9.6


扩展 contrib / unaccent 的标准 unaccent.rules 文件来处理Unicode已知的所有
变音符号,并且正确扩展连字(Thomas
Munro,LéonardBenedetti)

Extend contrib/unaccent's standard unaccent.rules file to handle all diacritics known to Unicode, and expand ligatures correctly (Thomas Munro, Léonard Benedetti)

大胆强调我的。现在我们得到:

Bold emphasis mine. Now we get:

SELECT unaccent('Œ Æ œ æ ß');

unaccent
----------
OE AE oe ae ss



模式匹配



LIKE 或具有任意模式的 ILIKE ,将其与模块 pg_trgm 。创建一个三元组GIN(通常更可取)或GIST表达式索引。 GIN示例:

Pattern matching

For LIKE or ILIKE with arbitrary patterns, combine this with the module pg_trgm in PostgreSQL 9.1 or later. Create a trigram GIN (typically preferable) or GIST expression index. Example for GIN:

CREATE INDEX users_unaccent_name_trgm_idx ON users
USING gin (f_unaccent(name) gin_trgm_ops);

可用于以下查询:

SELECT * FROM users
WHERE  f_unaccent(name) LIKE ('%' || f_unaccent('João') || '%');

GIN和GIST索引的维护成本比普通btree贵:

GIN and GIST indexes are more expensive to maintain than plain btree:

  • Difference between GiST and GIN index

对于左锚定模式,有更简单的解决方案。有关模式匹配和性能的更多信息:

There are simpler solutions for just left-anchored patterns. More about pattern matching and performance:

  • Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL

pg_trgm 还提供有用的运营商的相似性 ()和距离(< - >

pg_trgm also provides useful operators for "similarity" (%) and "distance" (<->).

Trigram索引还支持带有等的简单正则表达式。和不区分大小写模式匹配 ILIKE

Trigram indexes also support simple regular expressions with ~ et al. and case insensitive pattern matching with ILIKE:

  • PostgreSQL accent + case insensitive search

这篇关于PostgreSQL是否支持“重音不敏感”?排序规则?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆