解码如果它不是unicode [英] Decoding if it's not unicode
问题描述
我希望我的函数采用可能是unicode对象或utf-8编码字符串的参数。在我的函数内,我想将参数转换为unicode。我有这样的东西:
I want my function to take an argument that could be an unicode object or a utf-8 encoded string. Inside my function, I want to convert the argument to unicode. I have something like this:
def myfunction(text):
if not isinstance(text, unicode):
text = unicode(text, 'utf-8')
...
是否可以避免使用isinstance?我正在寻找更多的鸭子打字的东西。
Is it possible to avoid the use of isinstance? I was looking for something more duck-typing friendly.
在我的解码实验中,我遇到了几个Python的奇怪行为。例如:
During my experiments with decoding, I have run into several weird behaviours of Python. For instance:
>>> u'hello'.decode('utf-8')
u'hello'
>>> u'cer\xf3n'.decode('utf-8')
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in po
sition 3: ordinal not in range(128)
或
>>> u'hello'.decode('utf-8')
u'hello' 12:11
>>> unicode(u'hello', 'utf-8')
Traceback (most recent call last):
File "<input>", line 1, in <module>
TypeError: decoding Unicode is not supported
顺便说一句。我使用Python 2.6
By the way. I'm using Python 2.6
推荐答案
你可以尝试使用'utf-8'解码器解码,如果是没有工作,然后返回对象。
You could just try decoding it with the 'utf-8' codec, and if that does not work, then return the object.
def myfunction(text):
try:
text = unicode(text, 'utf-8')
except TypeError:
return text
print(myfunction(u'cer\xf3n'))
# cerón
当您使用unicode对象并调用其 decode
方法与'utf-8'
编解码器,Python首先尝试将unicode对象转换为字符串对象,然后调用字符串对象的解码('utf- 8)方法。
When you take a unicode object and call its decode
method with the 'utf-8'
codec, Python first tries to convert the unicode object to a string object, and then it calls the string object's decode('utf-8') method.
有时从unicode对象到字符串对象的转换失败,因为Python2默认使用ascii编解码器。
Sometimes the conversion from unicode object to string object fails because Python2 uses the ascii codec by default.
所以,一般来说,永远不要尝试解码unicode对象。或者,如果你必须尝试,将其陷入try..except块。可能有一些编解码器解码unicode对象在Python2中工作(见下文),但是它们已经在Python3中被删除。
So, in general, never try to decode unicode objects. Or, if you must try, trap it in a try..except block. There may be a few codecs for which decoding unicode objects works in Python2 (see below), but they have been removed in Python3.
看到这个 Python错误代码有关这个问题的有趣的讨论,
以及 Guido van Rossum的博客:
See this Python bug ticket for an interesting discussion of the issue, and also Guido van Rossum's blog:
我们对编解码器采用了略微不同的
方法:在Python 2中,
编解码器可以接受Unicode或
8位作为输入,并产生
输出, strong>在Py3k中,编码始终是一个
从Unicode(文本)
字符串转换为字节数组,
解码总是与
方向相反。这意味着我们必须
放弃一些不适合
这个模型的编解码器,例如rot13,base64
和bz2(这些转换还支持
,只是不通过
encode / decode API)。
"We are adopting a slightly different approach to codecs: while in Python 2, codecs can accept either Unicode or 8-bits as input and produce either as output, in Py3k, encoding is always a translation from a Unicode (text) string to an array of bytes, and decoding always goes the opposite direction. This means that we had to drop a few codecs that don't fit in this model, for example rot13, base64 and bz2 (those conversions are still supported, just not through the encode/decode API)."
这篇关于解码如果它不是unicode的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!