检查无效的UTF8 [英] Check for invalid UTF8

查看:460
本文介绍了检查无效的UTF8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从UTF8格式转换为十六进制的实际值。但是有一些无效的字节序列,我需要捕获。是否有快速方法来检查字符是否不属于C ++中的UTF8?

I am converting from UTF8 format to actual value in hex. However there are some invalid sequences of bytes that I need to catch. Is there a quick way to check if a character doesn't belong in UTF8 in C++?

推荐答案

(我使用的是Unicode 5.1.0版本的章节(p103);它是表3 - 7在Unicode 6.0.0版本的p94上,并且在Unicode 6.3版本的p95上 - 并且它在Unicode 8.0.0版本的p125上。)

Follow the tables in the Unicode standard, chapter 3. (I used the Unicode 5.1.0 version of the chapter (p103); it was Table 3-7 on p94 of the Unicode 6.0.0 version, and was on p95 in the Unicode 6.3 version — and it is on p125 of the Unicode 8.0.0 version.)

字节0xC0,0xC1和0xF5..0xFF不能出现在有效的UTF-8中。
有效序列被记录;所有其他都无效。

Bytes 0xC0, 0xC1, and 0xF5..0xFF cannot appear in valid UTF-8. The valid sequences are documented; all others are invalid.

Code Points        First Byte Second Byte Third Byte Fourth Byte
U+0000..U+007F     00..7F
U+0080..U+07FF     C2..DF     80..BF
U+0800..U+0FFF     E0         A0..BF      80..BF
U+1000..U+CFFF     E1..EC     80..BF      80..BF
U+D000..U+D7FF     ED         80..9F      80..BF
U+E000..U+FFFF     EE..EF     80..BF      80..BF
U+10000..U+3FFFF   F0         90..BF      80..BF     80..BF
U+40000..U+FFFFF   F1..F3     80..BF      80..BF     80..BF
U+100000..U+10FFFF F4         80..8F      80..BF     80..BF

请注意,第一个字节的某些值范围的不规则位于第二个字节。第三和第四个字节,如果需要,是一致的。注意,并不是每个被识别为有效的范围内的代码点都已被分配(并且一些显式地是非字符),因此还需要更多的验证。

Note that the irregularities are in the second byte for certain ranges of values of the first byte. The third and fourth bytes, when needed, are consistent. Note that not every code point within the ranges identified as valid has been allocated (and some are explicitly 'non-characters'), so there is more validation needed still.

代码点U + D800..U + DBFF用于UTF-16高代理,U + DC00..U + DFFF用于UTF-16低代理;那些不能出现在有效的UTF-8(你编码的值外的BMP - 基本多语言平面 - 直接在UTF-8),这就是为什么该范围被标记为无效。

The code points U+D800..U+DBFF are for UTF-16 high surrogates and U+DC00..U+DFFF are for UTF-16 low surrogates; those cannot appear in valid UTF-8 (you encode the values outside the BMP — Basic Multilingual Plane — directly in UTF-8), which is why that range is marked invalid.

其他排除范围(初始字节C0或C1,或初始字节E0后跟80..9F,或初始字节F0后跟80..8F)是非最小编码。例如,C080将编码U + 0000,但是由00编码,并且UTF-8定义非最小编码C080无效。而最大的Unicode码点是U + 10FFFF;从F4 90开始的UTF-8编码生成超出范围的值。

Other excluded ranges (initial byte C0 or C1, or initial byte E0 followed by 80..9F, or initial byte F0 followed by 80..8F) are non-minimal encodings. For example, C0 80 would encode U+0000, but that's encoded by 00, and UTF-8 defines that the non-minimal encoding C0 80 is invalid. And the maximum Unicode code point is U+10FFFF; UTF-8 encodings starting from F4 90 upwards generate values that are out of range.

这篇关于检查无效的UTF8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆