解码如果它不是unicode [英] Decoding if it's not unicode

查看:127
本文介绍了解码如果它不是unicode的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望我的函数采用可能是unicode对象或utf-8编码字符串的参数。在我的函数内,我想将参数转换为unicode。我有这样的东西:

I want my function to take an argument that could be an unicode object or a utf-8 encoded string. Inside my function, I want to convert the argument to unicode. I have something like this:

def myfunction(text):
    if not isinstance(text, unicode):
        text = unicode(text, 'utf-8')

    ...

是否可以避免使用isinstance?我正在寻找更多的鸭子打字的东西。

Is it possible to avoid the use of isinstance? I was looking for something more duck-typing friendly.

在我的解码实验中,我遇到了几个Python的奇怪行为。例如:

During my experiments with decoding, I have run into several weird behaviours of Python. For instance:

>>> u'hello'.decode('utf-8')
u'hello'
>>> u'cer\xf3n'.decode('utf-8')
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in po
sition 3: ordinal not in range(128)

>>> u'hello'.decode('utf-8')
u'hello' 12:11
>>> unicode(u'hello', 'utf-8')
Traceback (most recent call last):
File "<input>", line 1, in <module>
TypeError: decoding Unicode is not supported

顺便说一句。我使用Python 2.6

By the way. I'm using Python 2.6

推荐答案

你可以尝试使用'utf-8'解码器解码,如果是没有工作,然后返回对象。

You could just try decoding it with the 'utf-8' codec, and if that does not work, then return the object.

def myfunction(text):
    try:
        text = unicode(text, 'utf-8')
    except TypeError:
        return text

print(myfunction(u'cer\xf3n'))
# cerón

当您使用unicode对象并调用其 decode 方法与'utf-8'编解码器,Python首先尝试将unicode对象转换为字符串对象,然后调用字符串对象的解码('utf- 8)方法。

When you take a unicode object and call its decode method with the 'utf-8' codec, Python first tries to convert the unicode object to a string object, and then it calls the string object's decode('utf-8') method.

有时从unicode对象到字符串对象的转换失败,因为Python2默认使用ascii编解码器。

Sometimes the conversion from unicode object to string object fails because Python2 uses the ascii codec by default.

所以,一般来说,永远不要尝试解码unicode对象。或者,如果你必须尝试,将其陷入try..except块。可能有一些编解码器解码unicode对象在Python2中工作(见下文),但是它们已经在Python3中被删除。

So, in general, never try to decode unicode objects. Or, if you must try, trap it in a try..except block. There may be a few codecs for which decoding unicode objects works in Python2 (see below), but they have been removed in Python3.

看到这个 Python错误代码有关这个问题的有趣的讨论,
以及 Guido van Rossum的博客

See this Python bug ticket for an interesting discussion of the issue, and also Guido van Rossum's blog:


我们对编解码器采用了略微不同的
方法:在Python 2中,
编解码器可以接受Unicode或
8位作为输入,并产生
输出, strong>在Py3k中,编码始终是一个
从Unicode(文本)
字符串转换为字节数组,
解码总是与
方向相反。这意味着我们必须
放弃一些不适合
这个模型的编解码器,例如rot13,base64
和bz2(这些转换还支持
,只是不通过
encode / decode API)。

"We are adopting a slightly different approach to codecs: while in Python 2, codecs can accept either Unicode or 8-bits as input and produce either as output, in Py3k, encoding is always a translation from a Unicode (text) string to an array of bytes, and decoding always goes the opposite direction. This means that we had to drop a few codecs that don't fit in this model, for example rot13, base64 and bz2 (those conversions are still supported, just not through the encode/decode API)."

这篇关于解码如果它不是unicode的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆