将UTF-8八位位组转换为Unicode代码点 [英] Convert UTF-8 octets to unicode code points
问题描述
我有一组UTF-8八位位组,我需要将它们转换回Unicode码点.我该如何在python中做到这一点.
I have a set of UTF-8 octets and I need to convert them back to unicode code points. How can I do this in python.
例如UTF-8八位字节['0xc5','0x81']应转换为0x141码点.
e.g. UTF-8 octet ['0xc5','0x81'] should be converted to 0x141 codepoint.
推荐答案
Python 3.x:
在Python 3.x中,str
是Unicode文本的类,而bytes
用于包含八位字节.
Python 3.x:
In Python 3.x, str
is the class for Unicode text, and bytes
is for containing octets.
如果用八位字节"表示的字符串确实是"0xc5"(而不是"\ xc5"),则可以这样转换为bytes
:
If by "octets" you really mean strings in the form '0xc5' (rather than '\xc5') you can convert to bytes
like this:
>>> bytes(int(x,0) for x in ['0xc5', '0x81'])
b'\xc5\x81'
然后可以使用str
构造函数将其转换为str
(即Unicode)...
You can then convert to str
(ie: Unicode) using the str
constructor...
>>> str(b'\xc5\x81', 'utf-8')
'Ł'
...或通过调用bytes
对象上的.decode('utf-8')
:
...or by calling .decode('utf-8')
on the bytes
object:
>>> b'\xc5\x81'.decode('utf-8')
'Ł'
>>> hex(ord('Ł'))
'0x141'
Pre-3.x:
在3.x之前,str
类型是字节数组,而unicode
类型是Unicode文本.
Pre-3.x:
Prior to 3.x, the str
type was a byte array, and unicode
was for Unicode text.
同样,如果用八位字节"来表示字符串"0xc5"(而不是"\ xc5"),则可以这样转换它们:
Again, if by "octets" you really mean strings in the form '0xc5' (rather than '\xc5') you can convert them like this:
>>> ''.join(chr(int(x,0)) for x in ['0xc5', '0x81'])
'\xc5\x81'
然后您可以使用构造函数将其转换为unicode
...
You can then convert to unicode
using the constructor...
>>> unicode('\xc5\x81', 'utf-8')
u'\u0141'
...或通过调用str
上的.decode('utf-8')
:
...or by calling .decode('utf-8')
on the str
:
>>> '\xc5\x81'.decode('utf-8')
u'\u0141'
这篇关于将UTF-8八位位组转换为Unicode代码点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!