将UTF-8八位位组转换为Unicode代码点 [英] Convert UTF-8 octets to unicode code points

查看:127
本文介绍了将UTF-8八位位组转换为Unicode代码点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组UTF-8八位位组,我需要将它们转换回Unicode码点.我该如何在python中做到这一点.

I have a set of UTF-8 octets and I need to convert them back to unicode code points. How can I do this in python.

例如UTF-8八位字节['0xc5','0x81']应转换为0x141码点.

e.g. UTF-8 octet ['0xc5','0x81'] should be converted to 0x141 codepoint.

推荐答案

Python 3.x:

在Python 3.x中,str是Unicode文本的类,而bytes用于包含八位字节.

Python 3.x:

In Python 3.x, str is the class for Unicode text, and bytes is for containing octets.

如果用八位字节"表示的字符串确实是"0xc5"(而不是"\ xc5"),则可以这样转换为bytes:

If by "octets" you really mean strings in the form '0xc5' (rather than '\xc5') you can convert to bytes like this:

>>> bytes(int(x,0) for x in ['0xc5', '0x81'])
b'\xc5\x81'

然后可以使用str构造函数将其转换为str(即Unicode)...

You can then convert to str (ie: Unicode) using the str constructor...

>>> str(b'\xc5\x81', 'utf-8')
'Ł'

...或通过调用bytes对象上的.decode('utf-8'):

...or by calling .decode('utf-8') on the bytes object:

>>> b'\xc5\x81'.decode('utf-8')
'Ł'
>>> hex(ord('Ł'))
'0x141'

Pre-3.x:

在3.x之前,str类型是字节数组,而unicode类型是Unicode文本.

Pre-3.x:

Prior to 3.x, the str type was a byte array, and unicode was for Unicode text.

同样,如果用八位字节"来表示字符串"0xc5"(而不是"\ xc5"),则可以这样转换它们:

Again, if by "octets" you really mean strings in the form '0xc5' (rather than '\xc5') you can convert them like this:

>>> ''.join(chr(int(x,0)) for x in ['0xc5', '0x81'])
'\xc5\x81'

然后您可以使用构造函数将其转换为unicode ...

You can then convert to unicode using the constructor...

>>> unicode('\xc5\x81', 'utf-8')
u'\u0141'

...或通过调用str上的.decode('utf-8'):

...or by calling .decode('utf-8') on the str:

>>> '\xc5\x81'.decode('utf-8')
u'\u0141'

这篇关于将UTF-8八位位组转换为Unicode代码点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆