Postgres中的Unicode规范化 [英] Unicode normalization in Postgres

查看：107 发布时间：2020/5/29 20:43:28 postgresql unicode plpython

本文介绍了Postgres中的Unicode规范化的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有大量带有苏格兰和威尔士口音的地名（结合了坟墓，急性，回旋音和diarees），我需要将其更新为unicode归一化格式，例如，较短的格式00E1（\xe1）的价格为á而不是0061 + 0301（\x61\x301）

I have a large number of Scottish and Welsh accented place names (combining grave, acute, circumflex and diareses) which I need to update to their unicode normalized form, eg, the shorter form 00E1 (\xe1) for á instead of 0061 + 0301 (\x61\x301)

我从旧的Postgres小圈子中找到了解决方案使用pl / python从2009年开始的邮件列表，

I have found a solution from an old Postgres nabble mail list from 2009, using pl/python,

create or replace function unicode_normalize(str text) returns text as $$
  import unicodedata
  return unicodedata.normalize('NFC', str.decode('UTF-8'))
$$ LANGUAGE PLPYTHONU;

这可以按预期工作，但是让我怀疑是否有任何方法可以直接使用build -在Postgres函数中。

This works, as expected, but made me wonder if there was any way of doing it directly with built-in Postgres functions. I tried various conversions using convert_to, all in vain.

编辑：正如Craig指出的那样，这是我尝试过的事情之一：

As Craig has pointed out, and one of the things I tried:

SELECT convert_to(E'\u00E1', 'iso-8859-1');

返回 \xe1 ，而

SELECT convert_to(E'\u0061\u0301', 'iso-8859-1');

失败，出现ERROR：编码为 UTF8的字符0xcc81没有等效项在拉丁语中

推荐答案

我认为这是一个Pg错误。

I think this is a Pg bug.

在我看来，PostgreSQL应该在执行编码转换之前将utf-8标准化为预先编写的形式。显示的转换结果是错误的。

In my opinion, PostgreSQL should be normalizing utf-8 into pre-composed form before performing encoding conversions. The result of the conversions shown are wrong.

我将在pgsql-bugs上提高它...完成。

I'll raise it on pgsql-bugs ... done.

http ：//www.postgresql.org/message-id/53E179E1.3060404@2ndquadrant.com

您应该可以在此处关注该线程。

You should be able to follow the thread there.

编辑：pgsql-hackers似乎不同意，因此不太可能立即着手改变。我强烈建议您在应用程序输入边界对UTF-8进行标准化。

Edit: pgsql-hackers doesn't appear to agree, so this is unlikely to change in a hurry. I strongly advise you to normalise your UTF-8 at your application input boundaries.

BTW，这可以简化为：

BTW, this can be simplified down to:

regress=> SELECT 'á' = 'á';
 ?column? 
----------
 f
(1 row)

简直是疯狂的话题，但被允许。第一个是预先组成的，第二个则不是。（要查看此结果，您必须复制并粘贴，并且只有在您的浏览器或终端未对utf-8进行标准化的情况下，它才起作用）。

which is plain crazy-talk, but is permitted. The first is precomposed, the second is not. (To see this result you'll have to copy & paste, and it'll only work if your browser or terminal don't normalize utf-8).

如果您使用的是Firefox，则可能无法正确看到上述内容； Chrome可以正确呈现它。这是您的浏览器是否可以正确处理分解后的Unicode的内容：

If you're using Firefox you might not see the above correctly; Chrome renders it correctly. Here's what you should see if your browser handles decomposed Unicode correctly:

这篇关于Postgres中的Unicode规范化的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Postgres中的Unicode规范化 [英] Unicode normalization in Postgres

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Postgres中的Unicode规范化 [英] Unicode normalization in Postgres

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭