复合字符和代理对之间的区别 [英] Difference between composite characters and surrogate pairs

查看:109
本文介绍了复合字符和代理对之间的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Unicode中,复合字符和代理对之间有什么区别?

In Unicode what is the difference between composite characters and surrogate pairs?

在我看来,它们听起来很相似-两个字符代表一个字符.这两个概念有何区别?

To me they sound like similar things - two characters to represent one character. What differentiates these two concepts?

推荐答案

代理对是Unicode中的怪异疣.

Surrogate pairs are a weird wart in Unicode.

Unicode本身就是对数字含义的抽象分配.这就是 encoding .目前可以使用大写字母A,希腊字母alternate-terminal-sigma,克林贡语闭合括号2等,最多可使用2 21 .在Unicode上下文中,每个数字都称为一个代码点.

Unicode itself is nothing other than an abstract assignment of meaning to numbers. That's what an encoding is. Capital-letter-A, Greek-alternate-terminal-sigma, Klingon-closing-bracket-2, etc. currently, numbers up to about 221 are available, though not all are in use. In the context of Unicode, each number is know as a code point.

但是,整个Unicode套件不仅仅包含这种编码.它还包含用于序列化代码点的技术.从本质上讲,这只是序列化无符号整数的一种练习.指定了三个技术子家族:UTF-32,UTF-8和UTF-16.

However, the Unicode suite as a whole contains more than just this encoding. It also contains technologies to serialize code points. This is essentially just an exercise in serializing unsigned integers. Three subfamilies of technologies are specified: UTF-32, UTF-8, and UTF-16.

UTF-32仅将每个代码点表示为32位无符号整数.这很容易.存在两种变体,分别用于大尾数法和小尾数法.每个32位序列化整数都称为这种格式的 code unit ,这是一种固定宽度格式(每个代码单元一个代码点).

UTF-32 simply expresses every code-point as a 32-bit unsigned integer. That's easy. Two variants exist, for big and little endian, respectively. Each 32-bit serialized integer is called the code unit of this format, and this is a fixed-width format (one code point per code unit).

UTF-8是一种聪明的多字节格式,其中的代码点占用1到6个8位字节中的任何内容.这种格式具有很好的可移植性,因为它没有订购问题,并且对于英语,近英语和计算机代码来说,它非常紧凑. UTF-8的代码单位为一个字节,这是一种可变宽度格式(每个代码点1至6个代码单位).

UTF-8 is a clever multi-byte format, in which code points take up anything from one to six 8-bit bytes. This format is very portable, since it has no ordering issues and since it is pretty compact for English, near-English and computer code. The code unit of UTF-8 is one byte, and this is a variable-width format (1–6 code units per code point).

最后是UTF-16:最初,人们认为Unicode只能使用2 16 个数字,因此最初被认为是固定宽度的16位代码单元.但是,最终变得很清楚,我们需要更大的数量.因此,UTF-16现在也是一种可变宽度格式,但实现方式是某些16位代码单元充当指示符,表明它们是两个单元对(代理对)的一部分>.但是,为了简化检测这些对的方式,而不是像UTF-8那样具有某种外部信封格式,代理人使用的实际16位值被有意地泄漏回Unicode编码中,并从编码中排除了. -也就是说,替代值0xD800到0xDFFF 不是有效的Unicode代码点.

Finally, there's UTF-16: Initially, people thought Unicode could do with only 216 numbers, so this was initially deemed to be fixed-width, with 16-bit code units. However, it eventually became clear that we needed larger numbers. So UTF-16 is now also a variable-width format, but the way this is achieved is that certain 16-bit code units act as indicators that they are part of a two-unit pair, the surrogate pair. However, to simplify the way you detect those pairs, rather than having some external envelope format as UTF-8 does, the actual 16-bit values that are used by the surrogates are deliberately leaked back into the Unicode encoding and left out of the encoding - that is, the surrogate values, 0xD800 to 0xDFFF, are not valid Unicode code points.

因此,总而言之,替代是迫使Unicode的序列化格式重新回到编码中,并使编码的设计失真以适应序列化格式的结果.也许这是一次不幸的历史性事故,回想起来这是毫无意义和难看的,但这是我们拥有的以及我们需要生活的一切.

So, in summary, surrogates are the result of forcing a serialization format for Unicode back into the encoding, and distorting the design of the encoding to accommodate the serialization format. This is perhaps an unfortunate historical accident, which is somewhat pointless and unsightly in retrospect, but it's what we have and what we need to live with.

另一方面,复合字符是更高级别的字符:它们是由多个Unicode代码点组成的可视单元(字素").有时人们将代码点本身称为字符",但这有点误导,因为字符实际上应该是字素,并且它们可以由多个部分组成(例如,基本字母加上变音符号和修饰符).

Composite characters, on the other hand, are something much higher-level: They are visual units ("graphemes") that are composed of multiple Unicode code points. Sometimes people refer to code points themselves as "characters", but that's a little bit misleading, since characters should really be graphemes, and they can consist of several components (e.g. a base letter plus diacritics and modifiers).

这篇关于复合字符和代理对之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆