6个八位字节的UTF-8序列有效吗? [英] Are 6 octet UTF-8 sequences valid?

查看:107
本文介绍了6个八位字节的UTF-8序列有效吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

UTF-8可以对5个字节或6个字节的序列进行编码,从而允许对所有Unicode字符进行编码吗?我的标准相互矛盾.我需要能够支持每个Unicode字符,而不仅仅是U + 0000..U + 10FFFF范围内的那些字符.

Can UTF-8 encode 5 or 6 byte sequences, allowing all Unicode characters to be encoded? I'm getting conflicting standards. I need to be able to support every Unicode character, not just those in the U+0000..U+10FFFF range.

(所有引号均来自 RFC 3629 )

第3节:

在UTF-8中,字符范围为U + 0000..U + 10FFFF(UTF-16 可访问范围)使用1到4个八位字节的序列进行编码.这 只有一个序列"的八位位组的高阶位设置为0, 其余的7位用于编码字符号.在一个 n个八位位组的序列,n> 1,初始八位位组具有n个高阶 位设置为1,然后将位设置为0. 该八位位组包含要从字符数开始的位数 编码.以下八位位组的高阶位均设置为 1,随后的位设置为0,每个位留6位以包含 来自要编码字符的位.

In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16 accessible range) are encoded using sequences of 1 to 4 octets. The only octet of a "sequence" of one has the higher-order bit set to 0, the remaining 7 bits being used to encode the character number. In a sequence of n octets, n>1, the initial octet has the n higher-order bits set to 1, followed by a bit set to 0. The remaining bit(s) of that octet contain bits from the number of the character to be encoded. The following octet(s) all have the higher-order bit set to 1 and the following bit set to 0, leaving 6 bits in each to contain bits from the character to be encoded.

所以不是所有可能的字符都可以用UTF-8编码吗?这是否意味着我无法编码来自与BMP不同平面的字符?

So not all possible characters can be encoded with UTF-8? Does this mean I cannot encode characters from different planes than the BMP?

第2节:

八位组值C0,C1,F5至FF永远不会出现.

The octet values C0, C1, F5 to FF never appear.

这意味着我们不能使用5个或6个八位字节(甚至有些不在上述范围内的4个八位字节)对UTF-8值进行编码?

This means we cannot encode UTF-8 values with 5 or 6 octets (or even some with 4 that aren't within the above range)?

第12节:

将字符范围限制为0000-10FFFF(UTF-16 可访问范围).

Restricted the range of characters to 0000-10FFFF (the UTF-16 accessible range).

通过查看以前的RFC可以确认这一点……它们缩小了字符范围.

Looking at the previous RFC confirms this...they reduced the range of characters.

第10节:

编码为UTF-8时会发生另一个安全问题:ISO/IEC UTF-8的10646说明允许编码最多为 U + 7FFFFFFF,产生最多6个字节的序列.因此有 如果字符数范围不大,则存在缓冲区溢出的风险 明确限制为U + 10FFFF或不考虑缓冲区大小 考虑了5字节和6字节序列的可能性.

Another security issue occurs when encoding to UTF-8: the ISO/IEC 10646 description of UTF-8 allows encoding character numbers up to U+7FFFFFFF, yielding sequences of up to 6 bytes. There is therefore a risk of buffer overflow if the range of character numbers is not explicitly limited to U+10FFFF or if buffer sizing doesn't take into account the possibility of 5- and 6-byte sequences.

因此,按照ISO/IEC 10646定义允许使用这些序列,但不是RFC 3629定义允许使用这些序列?我应该跟随哪一个?

So these sequences are allowed per the ISO/IEC 10646 definition, but not the RFC 3629 definition? Which one should I follow?

谢谢.

推荐答案

它们不是 Unicode 超过10FFFF的字符,BMP覆盖0000至FFFF.

They are no Unicode characters beyond 10FFFF, the BMP covers 0000 through FFFF.

UTF-8 的定义是0-10FFFF.

UTF-8 is well-defined for 0-10FFFF.

这篇关于6个八位字节的UTF-8序列有效吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆