PostgreSQL 是否支持“不区分重音"?排序规则? [英] Does PostgreSQL support "accent insensitive" collations?

查看:17
本文介绍了PostgreSQL 是否支持“不区分重音"?排序规则?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Microsoft SQL Server 中,可以指定不区分重音"排序规则(用于数据库、表或列),这意味着可以进行类似的查询

SELECT * FROM users WHERE name LIKE 'João'

查找具有 Joao 名称的行.

我知道可以使用 unaccent_string contrib 函数从 PostgreSQL 中的字符串中去除重音符号,但我我想知道 PostgreSQL 是否支持这些不区分重音"排序规则,所以上面的 SELECT 可以工作.

解决方案

使用 unaccent 模块 - 这与您链接的内容完全不同.

<块引用>

unaccent 是一个文本搜索字典,可以去除重音符号(变音符号符号)来自词位.

每个数据库安装一次:

CREATE EXTENSION unaccent;

如果您收到如下错误:

<块引用>

错误:无法打开扩展控制文件/usr/share/postgresql//extension/unaccent.control":没有这样的文件或目录

按照相关答案中的说明在您的数据库服务器上安装 contrib 包:

除其他外,它还提供了您可以在示例中使用的函数 unaccent()(其中似乎不需要 LIKE).

选择 *来自用户WHERE unaccent(name) = unaccent('João');

索引

要对此类查询使用索引,请在表达式上创建 索引.然而,Postgres 只接受索引的IMMUTABLE 函数.如果函数可以为相同的输入返回不同的结果,则索引可能会静默中断.

unaccent()STABLE 而非 IMMUTABLE

不幸的是,unaccent() 只是 STABLE,而不是 IMMUTABLE.根据 这个pgsql-bugs上的线程,这是由于三个原因:

  1. 这取决于字典的行为.
  2. 这本词典没有硬连线连接.
  3. 因此它还取决于当前的 search_path,它可以轻松更改.

一些教程在网络上指示只是将函数可变性更改为 IMMUTABLE.这种暴力破解方法在某些情况下可能会崩溃.

其他人建议 简单的 IMMUTABLE 包装函数(就像我过去自己做的那样).

是否制作 变体一直存在争议有两个参数 IMMUTABLE 明确声明使用的字典.阅读这里IMMUTABLE unaccent() 函数的模块由 Musicbrainz 提供,在 Github 上提供.自己没有测试过.我想我想出了一个更好的主意:

目前最佳

这种方法比其他浮动解决方案更有效,也更安全.
创建一个 IMMUTABLE SQL 包装函数,执行带有硬连线模式限定函数和字典的双参数形式.

由于嵌套非不可变函数会禁用函数内联,因此它基于 C 函数的副本,(假)声明的 IMMUTABLE 也是如此.它的唯一目的是在SQL函数包装器中使用.不能单独使用.

需要复杂性,因为无法在 C 函数的声明中硬连线字典.(需要破解 C 代码本身.)SQL 包装函数执行此操作,并允许函数内联 表达式索引.

创建或替换函数 public.immutable_unaccent(regdictionary, text)返回文本 LANGUAGE c IMMUTABLE PARALLEL SAFE STRICT AS'$libdir/unaccent', 'unaccent_dict';创建或替换函数 public.f_unaccent(text)返回文本语言 sql IMMUTABLE PARALLEL SAFE STRICT AS$功能$选择 public.immutable_unaccent(regdictionary 'public.unaccent', $1)$函数$;

从 Postgres 9.5 或更早版本的两个函数中删除 PARALLEL SAFE.

public 是您安装扩展的架构(public 是默认值).

显式类型声明 (regdictionary) 可防御恶意用户使用函数的重载变体进行的假设攻击.

之前,我提倡基于 unaccent 模块附带的 STABLE 函数 unaccent() 的包装函数.这禁用了 函数内联.这个版本的执行速度比我之前在这里的简单包装函数快十倍.
这已经比在函数中添加 SET search_path = public, pg_temp 的第一个版本快两倍——直到我发现字典也可以是模式限定的.仍然(Postgres 12)从文档中不太明显.

如果您缺乏创建 C 函数所需的权限,那么您将回到次优实现:IMMUTABLE 函数包装器围绕 STABLE unaccent() 模块提供的函数:

创建或替换函数 public.f_unaccent(text)返回文本为$功能$SELECT public.unaccent('public.unaccent', $1) -- 模式限定函数和字典$func$ 语言 sql 不可变并行安全严格;

最后,表达式索引使查询快速:

创建索引 users_unaccent_name_idx ON users(public.f_unaccent(name));

记住在对函数或字典进行任何更改后重新创建索引涉及此函数,例如不会重新创建索引的就地主要版本升级.最近的主要版本都有 unaccent 模块的更新.

调整查询以匹配索引(因此查询计划器将使用它):

SELECT * FROM usersWHERE f_unaccent(name) = f_unaccent('João');

您不需要正确表达式中的函数.在那里,您还可以直接提供像 'Joao' 这样的无重音字符串.

使用 表达式,更快的函数不会转化为更快的查询索引.这在预先计算的值上运行,并且已经非常快了.但是索引维护和查询不使用索引的好处.

Postgres 10.3/9.6.8 等版本加强了客户端程序的安全性.您需要对用于任何索引的函数和字典名称进行模式限定.见:

连字

在 Postgres 9.5 或更早的 中,像 'Œ' 或 'ß' 之类的连字必须手动扩展(如果需要),因为 unaccent() 总是替换为单个字母:

SELECT unaccent('Œ Æ œ æ ß');不重音----------E A E A S

你会喜欢 Postgres 9.6 中对 unaccent 的更新:

<块引用>

扩展 contrib/unaccent 的标准 unaccent.rules 文件以处理所有Unicode 已知的变音符号,并正确扩展连字(Thomas门罗,伦纳德·贝内代蒂)

我的大胆强调.现在我们得到:

SELECT unaccent('Œ Æ œ æ ß');不重音----------OE AE OE AE SS

模式匹配

对于 LIKEILIKE 具有任意模式,将其与模块 pg_trgm.创建三元组 GIN(通常更可取)或 GIST 表达式索引.GIN 示例:

创建索引 users_unaccent_name_trgm_idx ON users使用杜松子酒(f_unaccent(名称)gin_trgm_ops);

可用于以下查询:

SELECT * FROM usersWHERE f_unaccent(name) LIKE ('%' || f_unaccent('João') || '%');

GIN 和 GIST 索引的维护成本比普通 btree 高:

对于左锚定模式有更简单的解决方案.有关模式匹配和性能的更多信息:

pg_trgm 还提供了有用的 运算符对于相似度"(%)和距离"(<->).

Trigram 索引还支持带有 ~ 等的简单正则表达式.和 不区分大小写 模式匹配 ILIKE:

In Microsoft SQL Server, it's possible to specify an "accent insensitive" collation (for a database, table or column), which means that it's possible for a query like

SELECT * FROM users WHERE name LIKE 'João'

to find a row with a Joao name.

I know that it's possible to strip accents from strings in PostgreSQL using the unaccent_string contrib function, but I'm wondering if PostgreSQL supports these "accent insensitive" collations so the SELECT above would work.

解决方案

Use the unaccent module for that - which is completely different from what you are linking to.

unaccent is a text search dictionary that removes accents (diacritic signs) from lexemes.

Install once per database with:

CREATE EXTENSION unaccent;

If you get an error like:

ERROR: could not open extension control file
"/usr/share/postgresql/<version>/extension/unaccent.control": No such file or directory

Install the contrib package on your database server like instructed in this related answer:

Among other things, it provides the function unaccent() you can use with your example (where LIKE seems not needed).

SELECT *
FROM   users
WHERE  unaccent(name) = unaccent('João');

Index

To use an index for that kind of query, create an index on the expression. However, Postgres only accepts IMMUTABLE functions for indexes. If a function can return a different result for the same input, the index could silently break.

unaccent() only STABLE not IMMUTABLE

Unfortunately, unaccent() is only STABLE, not IMMUTABLE. According to this thread on pgsql-bugs, this is due to three reasons:

  1. It depends on the behavior of a dictionary.
  2. There is no hard-wired connection to this dictionary.
  3. It therefore also depends on the current search_path, which can change easily.

Some tutorials on the web instruct to just alter the function volatility to IMMUTABLE. This brute-force method can break under certain conditions.

Others suggest a simple IMMUTABLE wrapper function (like I did myself in the past).

There is an ongoing debate whether to make the variant with two parameters IMMUTABLE which declares the used dictionary explicitly. Read here or here.

Another alternative would be this module with an IMMUTABLE unaccent() function by Musicbrainz, provided on Github. Haven't tested it myself. I think I have come up with a better idea:

Best for now

This approach is more efficient as other solutions floating around, and safer.
Create an IMMUTABLE SQL wrapper function executing the two-parameter form with hard-wired schema-qualified function and dictionary.

Since nesting a non-immutable function would disable function inlining, base it on a copy of the C-function, (fake) declared IMMUTABLE as well. Its only purpose is to be used in the SQL function wrapper. Not meant to be used on its own.

The sophistication is needed as there is no way to hard-wire the dictionary in the declaration of the C function. (Would require to hack the C code itself.) The SQL wrapper function does that and allows both function inlining and expression indexes.

CREATE OR REPLACE FUNCTION public.immutable_unaccent(regdictionary, text)
  RETURNS text LANGUAGE c IMMUTABLE PARALLEL SAFE STRICT AS
'$libdir/unaccent', 'unaccent_dict';

CREATE OR REPLACE FUNCTION public.f_unaccent(text)
  RETURNS text LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT AS
$func$
SELECT public.immutable_unaccent(regdictionary 'public.unaccent', $1)
$func$;

Drop PARALLEL SAFE from both functions for Postgres 9.5 or older.

public being the schema where you installed the extension (public is the default).

The explicit type declaration (regdictionary) defends against hypothetical attacks with overloaded variants of the function by malicious users.

Previously, I advocated a wrapper function based on the STABLE function unaccent() shipped with the unaccent module. That disabled function inlining. This version executes ten times faster than the simple wrapper function I had here earlier.
And that was already twice as fast as the first version which added SET search_path = public, pg_temp to the function - until I discovered that the dictionary can be schema-qualified, too. Still (Postgres 12) not too obvious from documentation.

If you lack the necessary privileges to create C functions, you are back to the second best implementation: An IMMUTABLE function wrapper around the STABLE unaccent() function provided by the module:

CREATE OR REPLACE FUNCTION public.f_unaccent(text)
  RETURNS text AS
$func$
SELECT public.unaccent('public.unaccent', $1)  -- schema-qualify function and dictionary
$func$  LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT;

Finally, the expression index to make queries fast:

CREATE INDEX users_unaccent_name_idx ON users(public.f_unaccent(name));

Remember to recreate indexes involving this function after any change to function or dictionary, like an in-place major release upgrade that would not recreate indexes. Recent major releases all had updates for the unaccent module.

Adapt queries to match the index (so the query planner will use it):

SELECT * FROM users
WHERE  f_unaccent(name) = f_unaccent('João');

You don't need the function in the right expression. There you can also supply unaccented strings like 'Joao' directly.

The faster function does not translate to much faster queries using the expression index. That operates on pre-computed values and is very fast already. But index maintenance and queries not using the index benefit.

Security for client programs has been tightened with Postgres 10.3 / 9.6.8 etc. You need to schema-qualify function and dictionary name as demonstrated when used in any indexes. See:

Ligatures

In Postgres 9.5 or older ligatures like 'Œ' or 'ß' have to be expanded manually (if you need that), since unaccent() always substitutes a single letter:

SELECT unaccent('Œ Æ œ æ ß');

unaccent
----------
E A e a S

You will love this update to unaccent in Postgres 9.6:

Extend contrib/unaccent's standard unaccent.rules file to handle all diacritics known to Unicode, and expand ligatures correctly (Thomas Munro, Léonard Benedetti)

Bold emphasis mine. Now we get:

SELECT unaccent('Œ Æ œ æ ß');

unaccent
----------
OE AE oe ae ss

Pattern matching

For LIKE or ILIKE with arbitrary patterns, combine this with the module pg_trgm in PostgreSQL 9.1 or later. Create a trigram GIN (typically preferable) or GIST expression index. Example for GIN:

CREATE INDEX users_unaccent_name_trgm_idx ON users
USING gin (f_unaccent(name) gin_trgm_ops);

Can be used for queries like:

SELECT * FROM users
WHERE  f_unaccent(name) LIKE ('%' || f_unaccent('João') || '%');

GIN and GIST indexes are more expensive to maintain than plain btree:

There are simpler solutions for just left-anchored patterns. More about pattern matching and performance:

pg_trgm also provides useful operators for "similarity" (%) and "distance" (<->).

Trigram indexes also support simple regular expressions with ~ et al. and case insensitive pattern matching with ILIKE:

这篇关于PostgreSQL 是否支持“不区分重音"?排序规则?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆