PostgreSQL 是否支持“重音不敏感"?整理? [英] Does PostgreSQL support "accent insensitive" collations?

查看:31
本文介绍了PostgreSQL 是否支持“重音不敏感"?整理?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Microsoft SQL Server 中,可以指定不区分重音"的排序规则(对于数据库、表或列),这意味着可以进行像

这样的查询

SELECT * FROM users WHERE name LIKE 'João'

查找具有 Joao 名称的行.

我知道可以使用 unaccent_string contrib 函数从 PostgreSQL 中的字符串中去除重音符号,但我我想知道 PostgreSQL 是否支持这些不区分重音"的排序规则,以便上面的 SELECT 可以工作.

解决方案

使用 unaccent 模块 - 这与您要链接的内容完全不同.

<块引用>

unaccent 是一个文本搜索词典,它去除了重音符号(变音符号符号)来自词素.

每个数据库安装一次:

CREATE EXTENSION unaccent;

如果您收到如下错误:

<块引用>

错误:无法打开扩展控制文件/usr/share/postgresql//extension/unaccent.control":没有那个文件或目录

按照相关答案中的说明在您的数据库服务器上安装 contrib 包:

除其他外,它提供了可以在示例中使用的函数 unaccent()(其中似乎不需要 LIKE).

SELECT *来自用户WHERE unaccent(name) = unaccent('João');

索引

要为这种查询使用索引,请在表达式上创建一个索引.然而,Postgres 只接受IMMUTABLE 函数作为索引.如果函数可以为相同的输入返回不同的结果,则索引可能会悄悄中断.

unaccent()STABLE 而不是 IMMUTABLE

不幸的是,unaccent() 只是 STABLE,而不是 IMMUTABLE.根据 errelthis="pgsql-bugs 上的线程,这是由于三个原因:

  1. 这取决于字典的行为.
  2. 没有硬连线连接到这本词典.
  3. 因此,它还取决于当前的 search_path,它可以轻松更改.

网络上的

一些教程将函数易变性更改为 IMMUTABLE.这种蛮力方法在某些条件下可能会失效.

其他人建议使用简单的IMMUTABLE 包装函数(就像我过去做的那样).

是否使 变体一直存在争论带有两个参数 IMMUTABLE 明确声明使用的字典.阅读此处这里.

另一种选择是这个带有 IMMUTABLE unaccent() 函数的模块由 Musicbrainz 提供,在 Github 上提供.自己没测试过.我想我想出了一个更好的主意:

现在最好

这种方法比其他解决方案更有效,也更安全.
创建一个 IMMUTABLE SQL 包装函数,执行带有硬接线模式限定函数和字典的两参数形式.

由于嵌套非不可变函数会禁用函数内联,因此它基于 C 函数的副本,(假的)声明了 IMMUTABLE.它的唯一目的是在 SQL 函数包装器中使用.不能单独使用.

需要复杂性,因为无法在 C 函数的声明中硬连接字典.(需要修改 C 代码本身.)SQL 包装函数可以做到这一点,并允许函数内联 表达式索引.

CREATE OR REPLACE FUNCTION public.immutable_unaccent(regdictionary, text)返回文本语言 c 不可变并行安全严格 AS'$libdir/unaccent', 'unaccent_dict';创建或替换函数 public.f_unaccent(text)返回文本语言 sql 不可变并行安全严格 AS$func$SELECT public.immutable_unaccent(regdictionary 'public.unaccent', $1)$func$;

从 Postgres 9.5 或更早版本的两个函数中删除 PARALLEL SAFE.

public 是您安装扩展的架构(public 是默认设置).

显式类型声明 (regdictionary) 可防御恶意用户使用函数的重载变体进行的假设性攻击.

以前,我提倡基于与 unaccent 模块一起提供的 STABLE 函数 unaccent() 的包装函数.这禁用了函数内联.此版本的执行速度比我之前在此处使用的简单包装函数十倍.
这已经是将 SET search_path = public, pg_temp 添加到函数的第一个版本的两倍 - 直到我发现字典也可以是模式限定的.仍然(Postgres 12)从文档中不太明显.

如果您缺乏创建 C 函数所需的权限,那么您将回到第二个最佳实现:STABLEIMMUTABLE 函数包装器> unaccent() 模块提供的函数:

创建或替换函数 public.f_unaccent(text)返回文本为$func$SELECT public.unaccent('public.unaccent', $1) -- 模式限定函数和字典$func$ LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT;

最后,表达式索引使查询快速:

CREATE INDEX users_unaccent_name_idx ON users(public.f_unaccent(name));

记住在对函数或字典进行任何更改后重新创建索引,例如不会重新创建索引的就地主要版本升级.最近的主要版本都对 unaccent 模块进行了更新.

调整查询以匹配索引(因此查询规划器将使用它):

SELECT * FROM usersWHERE f_unaccent(name) = f_unaccent('João');

您不需要正确表达式中的函数.在那里,您还可以直接提供像 'Joao' 这样的无重音字符串.

使用 表达式,更快的函数并没有转化为更快的查询索引.这对预先计算的值进行操作并且已经非常快了.但是索引维护和查询没有使用索引的好处.

Postgres 10.3/9.6.8 等加强了客户端程序的安全性.在任何索引中使用时,您需要对函数和字典名称进行模式限定.见:

连字

在 Postgres 9.5 或更早版本 中,像 'Œ' 或 'ß' 这样的连字必须手动扩展(如果你需要),因为 unaccent() 总是替换一个单个字母:

SELECT unaccent('Œ Æ œ æ ß');不重音----------东亚

你会喜欢此更新为 Postgres 9.6 中的 unaccent:

<块引用>

扩展 contrib/unaccent 的标准 unaccent.rules 文件来处理所有Unicode 已知的变音符号,并正确扩展连字(Thomas门罗,莱昂纳多·贝内代蒂)

粗体强调我的.现在我们得到:

SELECT unaccent('Œ Æ œ æ ß');不重音----------OE AE OE AE SS

模式匹配

对于具有任意模式的 LIKEILIKE,将其与模块 pg_trgm 在 PostgreSQL 9.1 或更高版本中.创建三元组 GIN(通常更可取)或 GIST 表达式索引.GIN 示例:

CREATE INDEX users_unaccent_name_trgm_idx ON users使用 gin (f_unaccent(name) gin_trgm_ops);

可用于以下查询:

SELECT * FROM usersWHERE f_unaccent(name) LIKE ('%' || f_unaccent('João') || '%');

GIN 和 GIST 索引比普通 btree 维护成本更高:

对于左锚定模式有更简单的解决方案.有关模式匹配和性能的更多信息:

pg_trgm 还提供了有用的运算符对于相似性"(%)和距离"(<->).

Trigram 索引还支持带有 ~ 等的简单正则表达式.和不区分大小写ILIKE匹配的模式:

In Microsoft SQL Server, it's possible to specify an "accent insensitive" collation (for a database, table or column), which means that it's possible for a query like

SELECT * FROM users WHERE name LIKE 'João'

to find a row with a Joao name.

I know that it's possible to strip accents from strings in PostgreSQL using the unaccent_string contrib function, but I'm wondering if PostgreSQL supports these "accent insensitive" collations so the SELECT above would work.

解决方案

Use the unaccent module for that - which is completely different from what you are linking to.

unaccent is a text search dictionary that removes accents (diacritic signs) from lexemes.

Install once per database with:

CREATE EXTENSION unaccent;

If you get an error like:

ERROR: could not open extension control file
"/usr/share/postgresql/<version>/extension/unaccent.control": No such file or directory

Install the contrib package on your database server like instructed in this related answer:

Among other things, it provides the function unaccent() you can use with your example (where LIKE seems not needed).

SELECT *
FROM   users
WHERE  unaccent(name) = unaccent('João');

Index

To use an index for that kind of query, create an index on the expression. However, Postgres only accepts IMMUTABLE functions for indexes. If a function can return a different result for the same input, the index could silently break.

unaccent() only STABLE not IMMUTABLE

Unfortunately, unaccent() is only STABLE, not IMMUTABLE. According to this thread on pgsql-bugs, this is due to three reasons:

  1. It depends on the behavior of a dictionary.
  2. There is no hard-wired connection to this dictionary.
  3. It therefore also depends on the current search_path, which can change easily.

Some tutorials on the web instruct to just alter the function volatility to IMMUTABLE. This brute-force method can break under certain conditions.

Others suggest a simple IMMUTABLE wrapper function (like I did myself in the past).

There is an ongoing debate whether to make the variant with two parameters IMMUTABLE which declares the used dictionary explicitly. Read here or here.

Another alternative would be this module with an IMMUTABLE unaccent() function by Musicbrainz, provided on Github. Haven't tested it myself. I think I have come up with a better idea:

Best for now

This approach is more efficient as other solutions floating around, and safer.
Create an IMMUTABLE SQL wrapper function executing the two-parameter form with hard-wired schema-qualified function and dictionary.

Since nesting a non-immutable function would disable function inlining, base it on a copy of the C-function, (fake) declared IMMUTABLE as well. Its only purpose is to be used in the SQL function wrapper. Not meant to be used on its own.

The sophistication is needed as there is no way to hard-wire the dictionary in the declaration of the C function. (Would require to hack the C code itself.) The SQL wrapper function does that and allows both function inlining and expression indexes.

CREATE OR REPLACE FUNCTION public.immutable_unaccent(regdictionary, text)
  RETURNS text LANGUAGE c IMMUTABLE PARALLEL SAFE STRICT AS
'$libdir/unaccent', 'unaccent_dict';

CREATE OR REPLACE FUNCTION public.f_unaccent(text)
  RETURNS text LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT AS
$func$
SELECT public.immutable_unaccent(regdictionary 'public.unaccent', $1)
$func$;

Drop PARALLEL SAFE from both functions for Postgres 9.5 or older.

public being the schema where you installed the extension (public is the default).

The explicit type declaration (regdictionary) defends against hypothetical attacks with overloaded variants of the function by malicious users.

Previously, I advocated a wrapper function based on the STABLE function unaccent() shipped with the unaccent module. That disabled function inlining. This version executes ten times faster than the simple wrapper function I had here earlier.
And that was already twice as fast as the first version which added SET search_path = public, pg_temp to the function - until I discovered that the dictionary can be schema-qualified, too. Still (Postgres 12) not too obvious from documentation.

If you lack the necessary privileges to create C functions, you are back to the second best implementation: An IMMUTABLE function wrapper around the STABLE unaccent() function provided by the module:

CREATE OR REPLACE FUNCTION public.f_unaccent(text)
  RETURNS text AS
$func$
SELECT public.unaccent('public.unaccent', $1)  -- schema-qualify function and dictionary
$func$  LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT;

Finally, the expression index to make queries fast:

CREATE INDEX users_unaccent_name_idx ON users(public.f_unaccent(name));

Remember to recreate indexes involving this function after any change to function or dictionary, like an in-place major release upgrade that would not recreate indexes. Recent major releases all had updates for the unaccent module.

Adapt queries to match the index (so the query planner will use it):

SELECT * FROM users
WHERE  f_unaccent(name) = f_unaccent('João');

You don't need the function in the right expression. There you can also supply unaccented strings like 'Joao' directly.

The faster function does not translate to much faster queries using the expression index. That operates on pre-computed values and is very fast already. But index maintenance and queries not using the index benefit.

Security for client programs has been tightened with Postgres 10.3 / 9.6.8 etc. You need to schema-qualify function and dictionary name as demonstrated when used in any indexes. See:

Ligatures

In Postgres 9.5 or older ligatures like 'Œ' or 'ß' have to be expanded manually (if you need that), since unaccent() always substitutes a single letter:

SELECT unaccent('Œ Æ œ æ ß');

unaccent
----------
E A e a S

You will love this update to unaccent in Postgres 9.6:

Extend contrib/unaccent's standard unaccent.rules file to handle all diacritics known to Unicode, and expand ligatures correctly (Thomas Munro, Léonard Benedetti)

Bold emphasis mine. Now we get:

SELECT unaccent('Œ Æ œ æ ß');

unaccent
----------
OE AE oe ae ss

Pattern matching

For LIKE or ILIKE with arbitrary patterns, combine this with the module pg_trgm in PostgreSQL 9.1 or later. Create a trigram GIN (typically preferable) or GIST expression index. Example for GIN:

CREATE INDEX users_unaccent_name_trgm_idx ON users
USING gin (f_unaccent(name) gin_trgm_ops);

Can be used for queries like:

SELECT * FROM users
WHERE  f_unaccent(name) LIKE ('%' || f_unaccent('João') || '%');

GIN and GIST indexes are more expensive to maintain than plain btree:

There are simpler solutions for just left-anchored patterns. More about pattern matching and performance:

pg_trgm also provides useful operators for "similarity" (%) and "distance" (<->).

Trigram indexes also support simple regular expressions with ~ et al. and case insensitive pattern matching with ILIKE:

这篇关于PostgreSQL 是否支持“重音不敏感"?整理?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆