PostgreSQL UTF-8二进制排序规则 [英] PostgreSQL UTF-8 binary collation

查看:115
本文介绍了PostgreSQL UTF-8二进制排序规则的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想拥有一个排序规则,该排序规则将0x1234之下的0x1234的UTF-8编码排序为0x1235,而与Unicode标准中的字符映射无关。 MySQL为此使用utf8_bin。 MSSQL显然 http://msdn.microsoft.com/en-us/library/ms143350.aspx具有BIN和BIN2归类。虽然发现这些内容很容易,但我什至找不到排序规则列表,PostgreSQL对这个特定问题的支持要少得多。

I would like to have a collation which orders the UTF-8 encoding of 0x1234 below of 0x1235 regardless of the character mapping in the Unicode standard. MySQL uses utf8_bin for this. MSSQL apparently http://msdn.microsoft.com/en-us/library/ms143350.aspx have BIN and BIN2 collations. While finding these were easy, I can't even find a list of collations PostgreSQL supports much less answer to this specific question.

推荐答案

C语言环境将起作用。 UTF-8的设计使字节顺序也是代码点顺序。这并非易事,但请考虑UTF-8的工作原理:

The C locale will do. UTF-8 is designed so that byte ordering is also codepoint ordering. This is not trivial but consider how UTF-8 works:


Number range  Byte 1   Byte 2   Byte 3
0000-007F     0xxxxxxx
0080-07FF     110xxxxx 10xxxxxx
0800-FFFF     1110xxxx 10xxxxxx 10xxxxxx

对二进制数据(也称为C语言环境)进行排序时,第一个非相等字节将确定问题。我们需要看到的是,如果编码为UTF-8的两个数字不同,则第一个非相等字节的值将较低。如果数字在不同的范围内,那么对于较低的数字,第一个字节的确会较低。在相同范围内,顺序由字面上与未编码相同的位决定。

When sorting binary data aka C locale, the first non-equal byte will etermine orering. What we neeed to see that if two numbers encoded into UTF-8 differ then the first non-equal byte will be lower for the lower value. If the numbers are in different ranges then the first byte will indeed be lower for the lower number. Within the same range, the order is determined by literally the same bits as without encoding.

这篇关于PostgreSQL UTF-8二进制排序规则的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆