Python中的UTF-8编码 [英] UTF-8 coding in Python

查看:491
本文介绍了Python中的UTF-8编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个用_编码的UTF-8字符,例如_ea_b4_80。
我想使用replace方法将其转换为UTF-8字符,但是我无法获得正确的编码。



这是一个代码示例:

  import sys 
reload(sys)
sys.setdefaultencoding('utf8')

r ='_ea_b4_80'
r2 ='\xea\xb4\x80'

r = r.replace('_','\\x')
打印r
打印r.encode(utf-8)
打印r2

在这个例子中,r与r2不一样;这是一个输出。

  \xea\xb4\x80 
\xea\xb4\ x80
관< - 正确显示

可能是什么错?

解决方案

\x 只对字符串文字有意义,重新无法使用替换添加它。



要获得所需的结果,转换为字节,然后解码:

  import binascii 

r ='_ea_b4_80'

rhexonly = r .replace('_','')#返回'eab480'
rbytes = binascii.unhexlify(rhexonly)#返回b'\xea\xb4\x80'
rtext = rbytes.decode ('utf-8')#​​返回'관'(如果Py2,str Py3,unicode)
print(rtext)

如果您愿意,您应该获得



如果您使用现代Py3,你可以避免导入(假设 r 实际上是一个 str ; 字节。 fromhex ,与 binascii.hexlify 不同,只采取 str 输入,而不是$ code>字节输入)使用 bytes.fromhex 类方法 binascii.unhexlify

  rbytes = bytes.fromhex(rhexonly )#返回b'\xea\xb4\x80'


I have an UTF-8 character encoded with `_' in between, e.g., '_ea_b4_80'. I'm trying to convert it into UTF-8 character using replace method, but I can't get the correct encoding.

This is a code example:

import sys
reload(sys)  
sys.setdefaultencoding('utf8')

r = '_ea_b4_80'
r2 = '\xea\xb4\x80'

r = r.replace('_', '\\x')
print r
print r.encode("utf-8")
print r2

In this example, r is not the same as r2; this is an output.

\xea\xb4\x80
\xea\xb4\x80
관  <-- correctly shown 

What might be wrong?

解决方案

\x is only meaningful in string literals, you're can't use replace to add it.

To get your desired result, convert to bytes, then decode:

import binascii

r = '_ea_b4_80'

rhexonly = r.replace('_', '')          # Returns 'eab480'
rbytes = binascii.unhexlify(rhexonly)  # Returns b'\xea\xb4\x80'
rtext = rbytes.decode('utf-8')         # Returns '관' (unicode if Py2, str Py3)
print(rtext)

which should get you as you desire.

If you're using modern Py3, you can avoid the import (assuming r is in fact a str; bytes.fromhex, unlike binascii.hexlify, only take str inputs, not bytes inputs) using the bytes.fromhex class method in place of binascii.unhexlify:

rbytes = bytes.fromhex(rhexonly)  # Returns b'\xea\xb4\x80'

这篇关于Python中的UTF-8编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆