这个字节流如何解释为希伯来语UTF-8编码? [英] How is this octet stream being interpreted as Hebrew UTF-8 encoding?

查看:373
本文介绍了这个字节流如何解释为希伯来语UTF-8编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下字节流以UTF-8标识,它包含希伯来语句子:דירותלשותפיםבתלאביב - הומלס。我想了解编码。

The following byte stream is identified by as UTF-8, it contains the Hebrew sentence: דירות לשותפים בתל אביב - הומלס. I'm trying to understand the encoding.

ubuntu@ip-10-126-21-104:~$ od -t x1 homeless-title-fromwireshark_followed_by_hexdump.txt
0000000 0a 09 d7 93 d7 99 d7 a8 d7 95 d7 aa 20 d7 9c d7
0000020 a9 d7 95 d7 aa d7 a4 d7 99 d7 9d 20 20 d7 91 d7
0000040 aa d7 9c 20 d7 90 d7 91 d7 99 d7 91 20 2d 20 d7
0000060 94 d7 95 d7 9e d7 9c d7 a1 0a
0000072
ubuntu@ip-10-126-21-104:~$ file -i homeless-title-fromwireshark_followed_by_hexdump.txt
homeless-title-fromwireshark_followed_by_hexdump.txt: text/plain; charset=utf-8

文件是UTF-8,我已经通过打开记事本Windows 7),输入希伯来字符ד,然后保存文件。其结果如下:

The file is UTF-8, I've verified this by opening notepad (windows 7), inputing the Hebrew character ד and then saving the file. The result of which yields the following:

ubuntu@ip-10-126-21-104:~$ od -t x1 test_from_notepad_utf8_daled.txt
0000000 ef bb bf d7 93
0000005
ubuntu@ip-10-126-21-104:~$ file -i test_from_notepad_utf8_daled.txt
test_from_notepad_utf8_daled.txt: text/plain; charset=utf-8

其中 ef bb bf 是以utf-8格式编码的BOM, d7 93 正是在 0a 09 (新行,在ascii中的标签)。

Where ef bb bf is the BOM encoded in utf-8 form and d7 93 is exactly the sequence of bytes that appears in the original stream after 0a 09 (new line, tab in ascii).

这里的问题是unicode代码页,ד应该编码为 05 D3 所以为什么和如何utf-8编码来到 d7 93

The problem here is that by unicode code pages, ד should be coded as 05 D3 so why and how did the utf-8 encoding came to be d7 93 ?

d7 93 的二进制文件是 11010111 10010011

05 D3 在二进制文件中是 00000101 11010011

d7 93 in binary is 11010111 10010011, while
05 D3 in binary is 00000101 11010011

我似乎找不到一个正确的转换,这将对这些编码有意义,(我的理解)代表同一个Unicode实体,是HEBREW LETTER DALET

I can't seem to find a correct transformation that will make sense for these encoding, that (to my understanding) represent the same Unicode entity, which is "HEBREW LETTER DALET"

谢谢

Maxim。

Thank you,
Maxim.

推荐答案

Unicode代码点U + 0000 ..U + 007F以UTF-8编码为单个字节0x00..0x7F。

Unicode code points U+0000..U+007F are encoded in UTF-8 as a single byte 0x00..0x7F.

Unicode码点u + 0080..U + 07FF(包括HEBREW LETTER DALET U + 05D3)以UTF-8编码为两个字节。这些值的二进制值可以分为一组5位和一组6位,如xxxxxyyyyyy。 UTF-8表示的第一个字节具有位模式110xxxxx;第二个具有位模式10yyyyy。

Unicode code points u+0080..U+07FF (including HEBREW LETTER DALET U+05D3) are encoded in UTF-8 as two bytes. The binary values for these can be split into a group of 5 bits and a group of 6 bits, as in xxxxxyyyyyy. The first byte of the UTF-8 representation has the bit pattern 110xxxxx; the second has the bit pattern 10yyyyyy.

0x05D3 = 0000 0101 1101 0011 

0x05D3的最后6位是010011;前缀为10,表示1001 0011或0x93。
前5位为10111;前缀为110,表示1101 0111或0xD7。

The last 6 bits of 0x05D3 are 010011; prefixed by the 10, that gives 1001 0011 or 0x93. The previous 5 bits are 10111; prefixed by the 110, that gives 1101 0111 or 0xD7.

因此,U + 05D3的UTF- 8编码为0xD7 0x93。

Hence, the UTF-8 encoding for U+05D3 is 0xD7 0x93.

对于UTF-8表示,需要3或4(但不是更多)字节的Unicode代码点U + 0800有更多的规则。连续字节总是具有10yyyyy位模式。第一字节具有位模式1110xxxx(3字节值)和11110xxx(4字节值)。有多个字节值不能出现在有效的UTF-8中;它们是0xC0,0xC1和0xF5..0xFF。

There are more rules for Unicode code points U+0800 upwards that require 3 or 4 (but not more) bytes for the UTF-8 representation. The continuation bytes always have the 10yyyyyy bit pattern. The first bytes have bit patterns 1110xxxx (3 byte values) and 11110xxx (4 byte values). There are a number of byte values that cannot appear in valid UTF-8; they are 0xC0, 0xC1, and 0xF5..0xFF.

这篇关于这个字节流如何解释为希伯来语UTF-8编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆