UTF-16如何实现自我同步? [英] How does UTF-16 achieve self-synchronization?

查看:116
本文介绍了UTF-16如何实现自我同步?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道UTF-16是一种自同步编码方案。我还阅读了下面的Wiki,但并没有完全理解。

I know that UTF-16 is a self-synchronizing encoding scheme. I also read the below Wiki, but did not quite get it.

自我同步代码

能否请您用一个UTF-16的例子来说明我?

Can you please explain me with an example of UTF-16?

推荐答案

在BTF之外的UTF-16中,使用代理对与第一个代码单元 (CU)位于0xD800-0xDBFF之间,第二个介于0xDC00-0xDFFF之间。每个CU代表代码点的10位。 BMP中的字符被编码为自身。

In UTF-16 characters outside of the BMP are represented using a surrogate pair in with the first code unit (CU) lies between 0xD800—0xDBFF and the second one between 0xDC00—0xDFFF. Each of the CU represents 10 bits of the code point. Characters in the BMP is encoded as itself.

现在,同步很容易。给定任意代码单元的位置:

Now the synchronization is easy. Given the position of any arbitrary code unit:


  • 如果代码单元在0xD800-0xDBFF范围内,则它是两个的第一个代码单元,只需阅读下一个并解码。 Voilà,我们在BMP之外有一个完整的字符

  • 如果代码单位在0xDC00-0xDFFF范围内,则它是两个代码单位的第二个,只需返回一个单位即可读取第一个部分,或前进到下一个单元以跳过当前字符

  • 如果该字符不在两个范围内,则说明它是BMP中的字符。我们不需要执行其他操作

在UTF-16中,CU是单位,即最小的元素。我们在CU级别上工作,并且逐一读取CU,而不是逐字节读取。 由于上述原因以及历史原因,UTF-16只能在CU级别进行自同步。

In UTF-16 CU is the unit, i.e. the smallest element. We work at the CU level and read the CU one-by-one instead of byte-by-byte. Because of that along with historical reasons UTF-16 is only self-synchronizable at CU level.

自同步的目的是要了解我们是否立即进入中间,而不必从头开始再次阅读并检查。 UTF-16允许我们这样做

The point of self-synchronization is to know whether we're in the middle of something immediately instead of having to read again from the start and check. UTF-16 allows us to do that


由于高替代,低替代和有效BMP字符的范围是不相交,代理人不可能匹配BMP字符,或者两个(部分)相邻字符看起来像合法代理人对。这大大简化了搜索。这也意味着UTF-16在16位字上是自同步:无需检查较早的代码单元就可以确定代码单元是否开始字符。 UTF-8具有这些优点,但是许多早期的多字节编码方案(例如 Shift JIS 以及其他亚洲多字节编码)不允许进行明确的搜索,并且只能通过从字符串开头重新解析来进行同步(如果丢失了一个字节或遍历是从随机字节开始的,UTF-16不会自同步) )。

Since the ranges for the high surrogates, low surrogates, and valid BMP characters are disjoint, it is not possible for a surrogate to match a BMP character, or for (parts of) two adjacent characters to look like a legal surrogate pair. This simplifies searches a great deal. It also means that UTF-16 is self-synchronizing on 16-bit words: whether a code unit starts a character can be determined without examining earlier code units. UTF-8 shares these advantages, but many earlier multi-byte encoding schemes (such as Shift JIS and other Asian multi-byte encodings) did not allow unambiguous searching and could only be synchronized by re-parsing from the start of the string (UTF-16 is not self-synchronizing if one byte is lost or if traversal starts at a random byte).

https://en.wikipedia .org / wiki / UTF-16#Description

当然,这意味着UTF-16可能不适用于没有以下内容的介质错误校正/检测,例如裸露的网络环境。但是,在适当的本地环境中,这比没有自同步的工作要好得多。例如,在您每次按下 Backspace 的 DOS / V日语版中, kbd>您必须从头开始进行迭代,才能知道删除了哪个字符,因为在糟糕的Shift-JIS编码中,无法知道没有长度图的光标之前的字符多长时间

Of course that means UTF-16 may be not suitable for working over a medium without error correction/detection like a bare network environment. However in a proper local environment it's a lot better than working without self-synchronization. For example in DOS/V for Japanese every time you press Backspace you must iterate from the start to know which character was deleted because in the awful Shift-JIS encoding there's no way to know how long the character before the cursor is without a length map

这篇关于UTF-16如何实现自我同步?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆