Postgresql正则表达式以匹配大写,支持Unicode [英] Postgresql regex to match uppercase, Unicode-aware

查看:224
本文介绍了Postgresql正则表达式以匹配大写,支持Unicode的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

标题总结得很好。我正在寻找与Postgres〜运算符匹配Unicode大写字符的正则表达式。
显而易见的方法无效:

The title sums it up pretty well. I'm looking for a regular expression matching Unicode uppercase character for the Postgres ~ operator. The obvious way doesn't work:

=> select 'A' ~ '[[:upper:]]';
 ?column? 
----------
 t
(1 row)

=> select 'Ó' ~ '[[:upper:]]';
 ?column? 
----------
 t
(1 row)

=> select 'Ą' ~ '[[:upper:]]';
 ?column? 
----------
 f
(1 row)

我正在使用Postgresql 9.1,并且我的语言环境设置为pl_PL.UTF-8。

I'm using Postgresql 9.1 and my locale is set to pl_PL.UTF-8. The ordering works fine.

=> show LC_CTYPE;
  lc_ctype   
-------------
 pl_PL.UTF-8
(1 row)


推荐答案

PG 9.1和更早版本的正则表达式引擎无法正确分类其代码点不适合的字符字节。
'Ó'的代码点是211,它正确了,但是'Ą'的代码点PG 9.2更好,但超过了255。

The regexp engine of PG 9.1 and older versions does not correctly classify characters whose codepoint doesn't fit it one byte. The codepoint of 'Ó' being 211 it gets it right, but the codepoint of 'Ą' is 260, beyond 255.

PG 9.2更好,尽管并不是所有字母都100%正确。请参见PostgreSQL源代码中的 commit ,尤其是注释的以下部分:

PG 9.2 is better at this, though still not 100% right for all alphabets. See this commit in PostgreSQL source code, and particularly these parts of the comment:


删除硬连接限制,不考虑
字符代码的wctype.h结果高于255

remove the hard-wired limitation to not consider wctype.h results for character codes above 255


我们可以将其提高到U + 7FF(我选择将其作为
个2字节UTF8字符的限制),这至少会使东欧
满意,直到有更好的解决方案

Still, we can push it up to U+7FF (which I chose as the limit of 2-byte UTF8 characters), which will at least make Eastern Europeans happy pending a better solution

不幸的是,这并未反向移植到9.1

Unfortunately this was not backported to 9.1

这篇关于Postgresql正则表达式以匹配大写,支持Unicode的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆