PostgreSQL Levenshtein和预组合字符与组合字符 [英] Postgresql levenshtein and precomposed character vs. combined character

查看:165
本文介绍了PostgreSQL Levenshtein和预组合字符与组合字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有包含两个相似外观字符的字符串。两者都显示为带有小齿的小字母 a:

I have Strings containing two similar looking characters. Both appear as small 'a's with an ogonek:

±

ą

(注意:根据渲染器的不同,有时它们的渲染方式相似,有时略有不同)

(Note: depending on the renderer they are sometimes rendered similarily, sometimes slightly differently)

但是,它们是不同的:

第一个字符的特征:

在PostgreSQL中:

In PostgreSQL:

select ascii('ą');
ascii 
-------
261

十六进制中的UTF-8编码为: \xC4\x85

The UTF-8-encoding in Hex is: \xC4\x85

因此它是预组成的字符 https://en.wikipedia.org/wiki/Precomposed_character

第二个字符的特征:

在PostgreSQL中:

In PostgreSQL:

select ascii('ą');
ascii 
-------
97

(与字符 a相同)

这强烈表明渲染的字符是两个字符组合而成的。确实是这样:

That strongly indicates that the rendered character is combined out of two characters. And it is indeed:

十六进制的UTF-8编码为: \x61\xCC\xA8

The UTF-8-encoding in Hex is: \x61\xCC\xA8

所以它是

a \x61\的组合

组合字符 https://en.wikipedia.org/wiki/Combining_character ),单独的ogonek:

and a combining character (https://en.wikipedia.org/wiki/Combining_character), the separate ogonek:

̨ \xCC\xA8

我想使用PostgreSQL的 levenshtein 函数来确定词语的相似性,因此我想将两个字符视为相同(这当然是由使用第一个或第二个字符写出不同实体名称的人所打算的。)

I want to use PostgreSQL's levenshtein function to determine the similarity of words, and so I want treat both characters as the same (as it is of course intended by people who write the name of a distinctive entity either with the 1st or the 2nd character).

我假设我可以使用 unaccent 总是摆脱ogonek,但这在第二种情况下不起作用:

I assumed that I can use unaccent to always get rid of the ogonek, but that is not working in the 2nd case:

第一个字符:预期结果:

1st character: expected result:

select levenshtein('ą', 'x');
levenshtein 
-------------
       1

第一个字符:预期结果:

1st character: expected result:

select levenshtein(unaccent('ą'), 'x');
levenshtein 
-------------
       1

第二个字符:预期结果:

2nd character: expected result:

select levenshtein('ą', 'x');
levenshtein 
-------------
       2

第二个字符:意外结果:

2nd character: unexpected result:

select levenshtein(unaccent('ą'), 'x');
levenshtein 
-------------
       2

因此,当我将两个字符与 levenshtein unaccent 进行比较时,结果为1:

So, when I compare both characters with levenshtein and unaccent, the result is 1:

select levenshtein(unaccent('ą'), unaccent('ą'));
levenshtein 
-------------
       1

而不是0。

在第二种情况下如何摆脱ogonek?

How can I "get rid of the ogonek" in the 2nd case?

(如何)可以使用字符串的UTF-8代码获得所获得的结果?

(How) can I use the UTF-8 codes of Strings to get the achieved result?

编辑:与@ s-man建议,将合并字符添加到 unaccent.rules 可以解决此特定问题。但是通常要使用 unaccent 解决 组合字符与组合字符 的问题,我必须明确添加/修改每个缺少的/配置错误的字符

Edit: As @s-man suggested, adding the combining character to unaccent.rules would solve this particular problem. But to generally solve the precomposed character vs. combined character problem with unaccent, I would have to explicitly add/modify every missing/"misconfigured" combined character to/in the config.

推荐答案

删除重音将使Levenshtein距离为0,但同时也会使您± a 之间的距离为0,听起来并不理想。

Removing accents will give you a Levenshtein distance of 0, but it will also give you a distance of 0 between ą and a, which does not sound ideal.

更好的解决方案是规范化 Unicode字符串,即在比较之前将组合字符序列 E'a\u0328'转换为预组合字符 E'\u0105'

The better solution would be to normalise the Unicode strings, i.e. to convert the combining character sequence E'a\u0328' into the precomposed character E'\u0105' before comparing them.

不幸的是,Postgres似乎没有内置的Unicode规范化功能,但是您可以通过 PL / Perl PL / Python 语言扩展。

Unfortunately, Postgres doesn't seem to have a built-in Unicode normalisation function, but you can easily access one via the PL/Perl or PL/Python language extensions.

例如:

create extension plpythonu;

create or replace function unicode_normalize(str text) returns text as $$
  import unicodedata
  return unicodedata.normalize('NFC', str.decode('UTF-8'))
$$ language plpythonu;

然后:

test=# select levenshtein(unicode_normalize(E'a\u0328'), unicode_normalize(E'\u0105'));
 levenshtein
-------------
           0

这也解决了您先前的问题中的问题,在该问题中,组合字符导致了Levenshtein距离:

This also solves the issue in your previous question, where the combining character was contributing to the Levenshtein distance:

test=# select levenshtein(unicode_normalize(E'a\u0328'), 'x');
 levenshtein
-------------
           1

这篇关于PostgreSQL Levenshtein和预组合字符与组合字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆