中文文件如何知道每个字符使用多少个字节? [英] How does a file with Chinese characters know how many bytes to use per character?

查看:336
本文介绍了中文文件如何知道每个字符使用多少个字节?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我阅读了Joel的文章绝对最小的每个软件开发人员绝对必须了解Unicode和字符集(没有借口!),但仍然不明白所有的细节。一个例子将说明我的问题。看下面这个文件:



替代文本http://www.yart.com.au/stackoverflow/unicode2.png



我已经用二进制编辑器打开文件,仔细检查第一个汉字旁边的三个a的最后一个:



alt text http://www.yart.com.au/stackoverflow/unicode1.png



根据Joel: / p>


在UTF-8中,0-127的每个代码点都存储在一个字节中。只有代码点128及以上才使用2,3存储,实际上最多可以存储6个字节。


编辑器说:


  1. E6(230)在代码点128之上。

  2. 因此,我将解释以下字节作为2,3,实际上最多6个字节。

如果是这样,表示解释超过2个字节? E6如何表示这个字节呢?



我的汉字是否存储在2,3,4,5或6字节?

解决方案

如果编码为UTF-8,则下表显示Unicode代码点(最多21位)如何转换为UTF-8编码:

 标量值第1字节第2字节第3字节第4字节
00000000 0xxxxxxx 0xxxxxxx
00000yyy yyxxxxxx 110yyyy 10xxxxxx
zzzzyyyy yyxxxxxx 1110zzzz 10yyyyy 10xxxxxx
000uuuuu zzzzyyyy yyxxxxxx 11110uuu 10uzzzz 10yyyyyy 10xxxxxx

有许多不允许的值 - 特别是,字节0xC1,0xC2和0xF5 - 0xFF永远不会出现在格式正确的UTF-8中。还有一些其他verboten组合。不规则在第1字节和第2字节列。请注意,代码U + D800 - U + DFFF保留给UTF-16代理,不能出现在有效的UTF-8中。

 代码点第一个字节第二个字节第三个字节第四个字节
U + 0000..U + 007F 00..7F
U + 0080..U + 07FF C2..DF 80..BF
U + 0800..U + 0FFF E0 A0..BF 80..BF
U + 1000..U + CFFF E1..EC 80..BF 80..BF
U + D000。 .U + D7FF ED 80..9F 80..BF
U + E000..U + FFFF EE..EF 80..BF 80..BF
U + 10000..U + 3FFFF F0 90..BF 80..BF 80..BF
U + 40000..U + FFFFF F1..F3 80..BF 80..BF 80..BF
U + 100000..U + 10FFFF F4 80..8F 80..BF 80..BF

这些表从 Unicode 标准版本5.1。






在问题中,偏移0x0010 .. 0x008F的资料产生:

  0x61 = U + 0061 
0x61 = U + 006 1
0x61 = U + 0061
0xE6 0xBE 0xB3 = U + 6FB3
0xE5 0xA4 0xA7 = U + 5927
0xE5 0x88 0xA9 = U + 5229
0xE4 0xBA 0x9A = U + 4E9A
0xE4 0xB8 0xAD = U + 4E2D
0xE6 0x96 0x87 = U + 6587
0xE8 0xAE 0xBA = U + 8BBA
0xE5 0x9D 0x9B = U + 575B
0x2C = U + 002C
0xE6 0xBE 0xB3 = U + 6FB3
0xE6 0xB4 0xB2 = U + 6D32
0xE8 0xAE 0xBA = U + 8BBA
0xE5 0x9D 0x9B = U + 575B
0x2C = U + 002C
0xE6 0xBE 0xB3 = U + 6FB3
0xE6 0xB4 0xB2 = U + 6D32
0xE6 0x96 0xB0 = U + 65B0
0xE9 0x97 0xBB = U + 95FB
0x2C = U + 002C
0xE6 0xBE 0xB3 = U + 6FB3
0xE6 0xB4 0xB2 = U + 6D32
0xE4 0xB8 0xAD = U + 4E2D
0xE6 0x96 0x87 = U + 6587
0xE7 0xBD 0x91 = U + 7F51
0xE7 0xAB 0x99 = U + 7AD9
0x2C = U + 002C
0xE6 0xBE 0xB3 = U + 6FB3
0xE5 0xA4 0xA7 = U + 5927
0xE5 0x88 0xA9 = U + 5229
0xE4 0xBA 0x9A = U + 4E9A
0xE6 0x9C 0x80 = U + 6700
0xE5 0xA4 0xA7 = U + 5927
0xE7 0x9A 0x84 = U + 7684
0xE5 0x 8D 0x8E = U + 534E
0x2D = U + 002D
0x29 = U + 0029
0xE5 0xA5 0xA5 = U + 5965
0xE5 0xB0 0xBA = U + 5C3A
0xE7 0xBD 0x91 = U + 7F51
0x26 = U + 0026
0x6C = U + 006C
0x74 = U + 0074
0x3B = U + 003B


I have read Joel's article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" but still don't understand all the details. An example will illustrate my issues. Look at this file below:

alt text http://www.yart.com.au/stackoverflow/unicode2.png

I have opened the file in a binary editor to closely examine the last of the three a's next to the first Chinese character:

alt text http://www.yart.com.au/stackoverflow/unicode1.png

According to Joel:

In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.

So does the editor say:

  1. E6 (230) is above code point 128.
  2. Thus I will interpret the following bytes as either 2, 3, in fact, up to 6 bytes.

If so, what indicates that the interpretation is more than 2 bytes? How is this indicated by the bytes that follow E6?

Is my Chinese character stored in 2, 3, 4, 5 or 6 bytes?

解决方案

If the encoding is UTF-8, then the following table shows how a Unicode code point (up to 21 bits) is converted into UTF-8 encoding:

Scalar Value                 1st Byte  2nd Byte  3rd Byte  4th Byte
00000000 0xxxxxxx            0xxxxxxx
00000yyy yyxxxxxx            110yyyyy  10xxxxxx
zzzzyyyy yyxxxxxx            1110zzzz  10yyyyyy  10xxxxxx
000uuuuu zzzzyyyy  yyxxxxxx  11110uuu  10uuzzzz  10yyyyyy  10xxxxxx

There are a number of non-allowed values - in particular, bytes 0xC1, 0xC2, and 0xF5 - 0xFF can never appear in well-formed UTF-8. There are also a number of other verboten combinations. The irregularities are in the 1st byte and 2nd byte columns. Note that the codes U+D800 - U+DFFF are reserved for UTF-16 surrogates and cannot appear in valid UTF-8.

Code Points          1st Byte  2nd Byte  3rd Byte  4th Byte
U+0000..U+007F       00..7F
U+0080..U+07FF       C2..DF    80..BF
U+0800..U+0FFF       E0        A0..BF    80..BF
U+1000..U+CFFF       E1..EC    80..BF    80..BF
U+D000..U+D7FF       ED        80..9F    80..BF
U+E000..U+FFFF       EE..EF    80..BF    80..BF
U+10000..U+3FFFF     F0        90..BF    80..BF    80..BF
U+40000..U+FFFFF     F1..F3    80..BF    80..BF    80..BF
U+100000..U+10FFFF   F4        80..8F    80..BF    80..BF

These tables are lifted from the Unicode standard version 5.1.


In the question, the material from offset 0x0010 .. 0x008F yields:

0x61           = U+0061
0x61           = U+0061
0x61           = U+0061
0xE6 0xBE 0xB3 = U+6FB3
0xE5 0xA4 0xA7 = U+5927
0xE5 0x88 0xA9 = U+5229
0xE4 0xBA 0x9A = U+4E9A
0xE4 0xB8 0xAD = U+4E2D
0xE6 0x96 0x87 = U+6587
0xE8 0xAE 0xBA = U+8BBA
0xE5 0x9D 0x9B = U+575B
0x2C           = U+002C
0xE6 0xBE 0xB3 = U+6FB3
0xE6 0xB4 0xB2 = U+6D32
0xE8 0xAE 0xBA = U+8BBA
0xE5 0x9D 0x9B = U+575B
0x2C           = U+002C
0xE6 0xBE 0xB3 = U+6FB3
0xE6 0xB4 0xB2 = U+6D32
0xE6 0x96 0xB0 = U+65B0
0xE9 0x97 0xBB = U+95FB
0x2C           = U+002C
0xE6 0xBE 0xB3 = U+6FB3
0xE6 0xB4 0xB2 = U+6D32
0xE4 0xB8 0xAD = U+4E2D
0xE6 0x96 0x87 = U+6587
0xE7 0xBD 0x91 = U+7F51
0xE7 0xAB 0x99 = U+7AD9
0x2C           = U+002C
0xE6 0xBE 0xB3 = U+6FB3
0xE5 0xA4 0xA7 = U+5927
0xE5 0x88 0xA9 = U+5229
0xE4 0xBA 0x9A = U+4E9A
0xE6 0x9C 0x80 = U+6700
0xE5 0xA4 0xA7 = U+5927
0xE7 0x9A 0x84 = U+7684
0xE5 0x8D 0x8E = U+534E
0x2D           = U+002D
0x29           = U+0029
0xE5 0xA5 0xA5 = U+5965
0xE5 0xB0 0xBA = U+5C3A
0xE7 0xBD 0x91 = U+7F51
0x26           = U+0026
0x6C           = U+006C
0x74           = U+0074
0x3B           = U+003B

这篇关于中文文件如何知道每个字符使用多少个字节?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆