处理 Python unicode 字符串中错误编码的字符 [英] Handle wrongly encoded character in Python unicode string

查看:32
本文介绍了处理 Python unicode 字符串中错误编码的字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理由 python-lastfm 库返回的 unicode 字符串.

我假设在途中的某个地方,库编码错误并返回一个可能包含无效字符的 unicode 字符串.

例如,我在变量 a 中期望的原始字符串是Glück"

<前>>>> 一个你'Glxfcck'>>> 打印一个回溯(最近一次调用最后一次):文件",第 1 行,在UnicodeEncodeError: 'ascii' 编解码器无法对位置 2 中的字符 u'xfc' 进行编码:序号不在范围内 (128)

xfc 是转义值252,对应ü"的latin1编码.不知何故,这以 Python 无法自行处理的方式嵌入到 unicode 字符串中.

如何将其转换回包含原始Glück"的普通或 unicode 字符串?我尝试使用解码/编码方法,但要么得到 UnicodeEncodeError,要么得到包含序列 xfc 的字符串.

解决方案

你的 unicode 字符串没问题:

<预><代码>>>>unicodedata.name(u"xfc")'带分音符的拉丁文小写字母 U'

您在交互式提示中看到的问题是解释器不知道使用什么编码将字符串输出到您的终端,因此它回退到ascii"编解码器——但该编解码器只知道如何处理 ASCII 字符.它在我的机器上运行良好(因为 sys.stdout.encoding 对我来说是UTF-8"——可能是因为我的环境变量设置与你的不同)

<预><代码>>>>打印 u'Glxfcck'格吕克

I am dealing with unicode strings returned by the python-lastfm library.

I assume somewhere on the way, the library gets the encoding wrong and returns a unicode string that may contain invalid characters.

For example, the original string i am expecting in the variable a is "Glück"

>>> a
u'Glxfcck'
>>> print a
Traceback (most recent call last):
  File "", line 1, in 
UnicodeEncodeError: 'ascii' codec can't encode character u'xfc' in position 2: ordinal not in range(128)

xfc is the escaped value 252, which corresponds to the latin1 encoding of "ü". Somehow this gets embedded in the unicode string in a way python can't handle on its own.

How do i convert this back a normal or unicode string that contains the original "Glück"? I tried playing around with the decode/encode methods, but either got a UnicodeEncodeError, or a string containing the sequence xfc.

解决方案

Your unicode string is fine:

>>> unicodedata.name(u"xfc")
'LATIN SMALL LETTER U WITH DIAERESIS'

The problem you see at the interactive prompt is that the interpreter doesn't know what encoding to use to output the string to your terminal, so it falls back to the "ascii" codec -- but that codec only knows how to deal with ASCII characters. It works fine on my machine (because sys.stdout.encoding is "UTF-8" for me -- likely because something like my environment variable settings differ from yours)

>>> print u'Glxfcck'
Glück

这篇关于处理 Python unicode 字符串中错误编码的字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆