6个八位字节的UTF-8序列有效吗? [英] Are 6 octet UTF-8 sequences valid?

查看：107 发布时间：2020/7/13 3:18:24 unicode utf-8

本文介绍了6个八位字节的UTF-8序列有效吗?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

UTF-8可以对5个字节或6个字节的序列进行编码，从而允许对所有Unicode字符进行编码吗?我的标准相互矛盾.我需要能够支持每个Unicode字符，而不仅仅是U + 0000..U + 10FFFF范围内的那些字符.

Can UTF-8 encode 5 or 6 byte sequences, allowing all Unicode characters to be encoded? I'm getting conflicting standards. I need to be able to support every Unicode character, not just those in the U+0000..U+10FFFF range.

(所有引号均来自 RFC 3629 )

第3节:

在UTF-8中，字符范围为U + 0000..U + 10FFFF(UTF-16 可访问范围)使用1到4个八位字节的序列进行编码.这只有一个序列"的八位位组的高阶位设置为0，其余的7位用于编码字符号.在一个 n个八位位组的序列，n> 1，初始八位位组具有n个高阶位设置为1，然后将位设置为0. 该八位位组包含要从字符数开始的位数编码.以下八位位组的高阶位均设置为 1，随后的位设置为0，每个位留6位以包含来自要编码字符的位.

In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16 accessible range) are encoded using sequences of 1 to 4 octets. The only octet of a "sequence" of one has the higher-order bit set to 0, the remaining 7 bits being used to encode the character number. In a sequence of n octets, n>1, the initial octet has the n higher-order bits set to 1, followed by a bit set to 0. The remaining bit(s) of that octet contain bits from the number of the character to be encoded. The following octet(s) all have the higher-order bit set to 1 and the following bit set to 0, leaving 6 bits in each to contain bits from the character to be encoded.

所以不是所有可能的字符都可以用UTF-8编码吗?这是否意味着我无法编码来自与BMP不同平面的字符?

So not all possible characters can be encoded with UTF-8? Does this mean I cannot encode characters from different planes than the BMP?

第2节:

八位组值C0，C1，F5至FF永远不会出现.

The octet values C0, C1, F5 to FF never appear.

这意味着我们不能使用5个或6个八位字节(甚至有些不在上述范围内的4个八位字节)对UTF-8值进行编码?

This means we cannot encode UTF-8 values with 5 or 6 octets (or even some with 4 that aren't within the above range)?

第12节:

将字符范围限制为0000-10FFFF(UTF-16 可访问范围).

Restricted the range of characters to 0000-10FFFF (the UTF-16 accessible range).

通过查看以前的RFC可以确认这一点……它们缩小了字符范围.

Looking at the previous RFC confirms this...they reduced the range of characters.

第10节:

编码为UTF-8时会发生另一个安全问题:ISO/IEC UTF-8的10646说明允许编码最多为 U + 7FFFFFFF，产生最多6个字节的序列.因此有如果字符数范围不大，则存在缓冲区溢出的风险明确限制为U + 10FFFF或不考虑缓冲区大小考虑了5字节和6字节序列的可能性.

Another security issue occurs when encoding to UTF-8: the ISO/IEC 10646 description of UTF-8 allows encoding character numbers up to U+7FFFFFFF, yielding sequences of up to 6 bytes. There is therefore a risk of buffer overflow if the range of character numbers is not explicitly limited to U+10FFFF or if buffer sizing doesn't take into account the possibility of 5- and 6-byte sequences.

因此，按照ISO/IEC 10646定义允许使用这些序列，但不是RFC 3629定义允许使用这些序列?我应该跟随哪一个?

So these sequences are allowed per the ISO/IEC 10646 definition, but not the RFC 3629 definition? Which one should I follow?

谢谢.

6个八位字节的UTF-8序列有效吗? [英] Are 6 octet UTF-8 sequences valid?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

6个八位字节的UTF-8序列有效吗? [英] Are 6 octet UTF-8 sequences valid?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭