手动将 unicode 代码点转换为 UTF-8 和 UTF-16 [英] Manually converting unicode codepoints into UTF-8 and UTF-16

查看:25
本文介绍了手动将 unicode 代码点转换为 UTF-8 和 UTF-16的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我即将参加大学编程考试,其中一个部分是关于 unicode 的.

I have a university programming exam coming up, and one section is on unicode.

我已经检查了所有的答案,我的讲师没用,所以没有帮助,所以这是你们可能提供帮助的最后手段.

I have checked all over for answers to this, and my lecturer is useless so that’s no help, so this is a last resort for you guys to possibly help.

问题将类似于:

字符串 'mЖ丽' 具有这些 un​​icode 代码点 U+006DU+0416U+4E3D,用十六进制写的答案,手动编码字符串转换为 UTF-8 和 UTF-16.

The string 'mЖ丽' has these unicode codepoints U+006D, U+0416 and U+4E3D, with answers written in hexadecimal, manually encode the string into UTF-8 and UTF-16.

在我试图解决这个问题时,任何帮助都将不胜感激.

Any help at all will be greatly appreciated as I am trying to get my head round this.

推荐答案

哇.一方面,我很高兴知道大学课程正在教授字符编码是一项艰苦的工作,但实际上了解 UTF-8 编码规则听起来像是期待很多.(它会帮助学生通过土耳其考试吗?)

Wow. On the one hand I'm thrilled to know that university courses are teaching to the reality that character encodings are hard work, but actually knowing the UTF-8 encoding rules sounds like expecting a lot. (Will it help students pass the Turkey test?)

到目前为止,我所看到的关于将 UCS 代码点编码为 UTF-8 的规则的最清晰的描述来自许多 Linux 系统上的 utf-8(7) 手册页:

The clearest description I've seen so far for the rules to encode UCS codepoints to UTF-8 are from the utf-8(7) manpage on many Linux systems:

Encoding
   The following byte sequences are used to represent a
   character.  The sequence to be used depends on the UCS code
   number of the character:

   0x00000000 - 0x0000007F:
       0xxxxxxx

   0x00000080 - 0x000007FF:
       110xxxxx 10xxxxxx

   0x00000800 - 0x0000FFFF:
       1110xxxx 10xxxxxx 10xxxxxx

   0x00010000 - 0x001FFFFF:
       11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

   [... removed obsolete five and six byte forms ...]

   The xxx bit positions are filled with the bits of the
   character code number in binary representation.  Only the
   shortest possible multibyte sequence which can represent the
   code number of the character can be used.

   The UCS code values 0xd800–0xdfff (UTF-16 surrogates) as well
   as 0xfffe and 0xffff (UCS noncharacters) should not appear in
   conforming UTF-8 streams.

可能更容易记住图表的压缩"版本:

It might be easier to remember a 'compressed' version of the chart:

重整代码点的初始字节以 1 开头,并添加填充 1+0.后续字节从 10 开始.

Initial bytes starts of mangled codepoints start with a 1, and add padding 1+0. Subsequent bytes start 10.

0x80      5 bits, one byte
0x800     4 bits, two bytes
0x10000   3 bits, three bytes

您可以通过记下可以用新表示中允许的位填充多少空间来推导出范围:

You can derive the ranges by taking note of how much space you can fill with the bits allowed in the new representation:

2**(5+1*6) == 2048       == 0x800
2**(4+2*6) == 65536      == 0x10000
2**(3+3*6) == 2097152    == 0x200000

我知道可以比图表本身更容易记住导出图表的规则.希望你也善于记住规则.:)

I know I could remember the rules to derive the chart easier than the chart itself. Here's hoping you're good at remembering rules too. :)

更新

一旦你建立了上面的图表,你可以通过找到它们的范围,将输入的 Unicode 代码点转换为 UTF-8,从十六进制转换为二进制,根据上述规则插入位,然后转换回十六进制:

Once you have built the chart above, you can convert input Unicode codepoints to UTF-8 by finding their range, converting from hexadecimal to binary, inserting the bits according to the rules above, then converting back to hex:

U+4E3E

这符合 0x00000800 - 0x0000FFFF 范围(0x4E3E <0xFFFF),因此表示形式为:

This fits in the 0x00000800 - 0x0000FFFF range (0x4E3E < 0xFFFF), so the representation will be of the form:

   1110xxxx 10xxxxxx 10xxxxxx

0x4E3E100111000111110b.将位放入上面的 x(从右侧开始,我们将在开头用 0 填充缺失的位):

0x4E3E is 100111000111110b. Drop the bits into the x above (start from the right, we'll fill in missing bits at the start with 0):

   1110x100 10111000 10111110

开头有一个x点,用0填入:

There is an x spot left over at the start, fill it in with 0:

   11100100 10111000 10111110

位转换为十六进制:

   0xE4 0xB8 0xBE

这篇关于手动将 unicode 代码点转换为 UTF-8 和 UTF-16的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆