urllib.urlencode不喜欢unicode值:这种解决方法如何? [英] urllib.urlencode doesn't like unicode values: how about this workaround?

查看:99
本文介绍了urllib.urlencode不喜欢unicode值:这种解决方法如何?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我有一个像这样的对象:

If I have an object like:

d = {'a':1, 'en': 'hello'}

...然后我可以将其传递给urllib.urlencode,没问题:

...then I can pass it to urllib.urlencode, no problem:

percent_escaped = urlencode(d)
print percent_escaped

但是,如果我尝试传递值类型为unicode的对象,则游戏结束:

But if I try to pass an object with a value of type unicode, game over:

d2 = {'a':1, 'en': 'hello', 'pt': u'olá'}
percent_escaped = urlencode(d2)
print percent_escaped # This fails with a UnicodeEncodingError

所以我的问题是关于准备要传递给urlencode的对象的可靠方法.

So my question is about a reliable way to prepare an object to be passed to urlencode.

我想到了这个函数,在其中我简单地遍历对象并编码string或unicode类型的值:

I came up with this function where I simply iterate through the object and encode values of type string or unicode:

def encode_object(object):
  for k,v in object.items():
    if type(v) in (str, unicode):
      object[k] = v.encode('utf-8')
  return object

这似乎可行:

d2 = {'a':1, 'en': 'hello', 'pt': u'olá'}
percent_escaped = urlencode(encode_object(d2))
print percent_escaped

然后输出a=1&en=hello&pt=%C3%B3la,准备传递给POST调用或其他任何内容.

And that outputs a=1&en=hello&pt=%C3%B3la, ready for passing to a POST call or whatever.

但是我的encode_object函数对我来说真的很不稳定.一方面,它不处理嵌套对象.

But my encode_object function just looks really shaky to me. For one thing, it doesn't handle nested objects.

对于另一个,我很担心if语句.我还应该考虑其他类型吗?

For another, I'm nervous about that if statement. Are there any other types that I should be taking into account?

是否正在像这样的良好做法将某物的type()与本机对象进行比较?

And is comparing the type() of something to the native object like this good practice?

type(v) in (str, unicode) # not so sure about this...

谢谢!

推荐答案

您确实应该很紧张.在某些数据结构中可能混合使用字节和文本的整个想法令人震惊.它违反了处理字符串数据的基本原理:在输入时进行解码,仅在unicode中工作,在输出时进行编码.

You should indeed be nervous. The whole idea that you might have a mixture of bytes and text in some data structure is horrifying. It violates the fundamental principle of working with string data: decode at input time, work exclusively in unicode, encode at output time.

根据评论进行更新:

您将要输出某种HTTP请求.这需要准备为字节字符串.如果您的字典中包含序数> = 128的Unicode字符,则urllib.urlencode无法正确准备该字节字符串的事实确实很不幸.如果您的字典中混用了字节字符串和unicode字符串,则需要小心.让我们检查一下urlencode()的作用:

You are about to output some sort of HTTP request. This needs to be prepared as a byte string. The fact that urllib.urlencode is not capable of properly preparing that byte string if there are unicode characters with ordinal >= 128 in your dict is indeed unfortunate. If you have a mixture of byte strings and unicode strings in your dict, you need to be careful. Let's examine just what urlencode() does:

>>> import urllib
>>> tests = ['\x80', '\xe2\x82\xac', 1, '1', u'1', u'\x80', u'\u20ac']
>>> for test in tests:
...     print repr(test), repr(urllib.urlencode({'a':test}))
...
'\x80' 'a=%80'
'\xe2\x82\xac' 'a=%E2%82%AC'
1 'a=1'
'1' 'a=1'
u'1' 'a=1'
u'\x80'
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "C:\python27\lib\urllib.py", line 1282, in urlencode
    v = quote_plus(str(v))
UnicodeEncodeError: 'ascii' codec can't encode character u'\x80' in position 0: ordinal not in range(128)

最后两个测试演示了urlencode()的问题.现在让我们看一下str测试.

The last two tests demonstrate the problem with urlencode(). Now let's look at the str tests.

如果您坚持混合使用,那么至少应该确保str对象以UTF-8编码.

If you insist on having a mixture, then you should at the very least ensure that the str objects are encoded in UTF-8.

'\ x80'可疑-不是any_valid_unicode_string.encode('utf8')的结果.
'\ xe2 \ x82 \ xac'正常;这是u'\ u20ac'.encode('utf8')的结果.
'1'是可以的-输入urlencode()时,所有ASCII字符都可以.如果需要,它将进行百分比编码,例如'%'.

'\x80' is suspicious -- it is not the result of any_valid_unicode_string.encode('utf8').
'\xe2\x82\xac' is OK; it's the result of u'\u20ac'.encode('utf8').
'1' is OK -- all ASCII characters are OK on input to urlencode(), which will percent-encode such as '%' if necessary.

这是建议的转换器功能.它不会改变输入字典,也不会返回输入字典(就像您一样);它不会更改输入字典.它返回一个新的字典.如果值是str对象但不是有效的UTF-8字符串,则将强制执行异常.顺便说一句,您对它不处理嵌套对象的担心有点误导了您的代码,仅对字典起作用,而嵌套字典的概念并没有真正实现.

Here's a suggested converter function. It doesn't mutate the input dict as well as returning it (as yours does); it returns a new dict. It forces an exception if a value is a str object but is not a valid UTF-8 string. By the way, your concern about it not handling nested objects is a little misdirected -- your code works only with dicts, and the concept of nested dicts doesn't really fly.

def encoded_dict(in_dict):
    out_dict = {}
    for k, v in in_dict.iteritems():
        if isinstance(v, unicode):
            v = v.encode('utf8')
        elif isinstance(v, str):
            # Must be encoded in UTF-8
            v.decode('utf8')
        out_dict[k] = v
    return out_dict

这是输出,以相反的顺序使用相同的测试(因为这次令人讨厌的测试位于最前面):

and here's the output, using the same tests in reverse order (because the nasty one is at the front this time):

>>> for test in tests[::-1]:
...     print repr(test), repr(urllib.urlencode(encoded_dict({'a':test})))
...
u'\u20ac' 'a=%E2%82%AC'
u'\x80' 'a=%C2%80'
u'1' 'a=1'
'1' 'a=1'
1 'a=1'
'\xe2\x82\xac' 'a=%E2%82%AC'
'\x80'
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "<stdin>", line 8, in encoded_dict
  File "C:\python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
>>>

有帮助吗?

这篇关于urllib.urlencode不喜欢unicode值:这种解决方法如何?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆