是否有可用于 urllib.quote 和 urllib.unquote 在 Python 2.6.5 中的 unicode-ready 替代品? [英] Is there a unicode-ready substitute I can use for urllib.quote and urllib.unquote in Python 2.6.5?

查看:22
本文介绍了是否有可用于 urllib.quote 和 urllib.unquote 在 Python 2.6.5 中的 unicode-ready 替代品?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Python 的 urllib.quoteurllib.unquote 在 Python 2.6.5 中不能正确处理 Unicode.事情是这样的:

在 [5]: 打印 urllib.unquote(urllib.quote(u'Cataño'))---------------------------------------------------------------------------KeyError 回溯(最近一次调用最后一次)/home/kkinder/在 <module>()/usr/lib/python2.6/urllib.pyc 引用(s,安全)1222 safe_map[c] = (c in safe) and c or ('%%%02X' % i)第1223章->第 1224 章1225 返回 ''.join(res)1226KeyError: u'\xc3'

将值编码为 UTF8 也不起作用:

在 [6] 中:打印 urllib.unquote(urllib.quote(u'Cataño'.encode('utf8')))卡塔奥

它被认为是一个错误并且有一个修复,但不适用于我的Python 版本.

我想要的是类似于 urllib.quote/urllib.unquote 的东西,但正确处理 unicode 变量,这样这段代码就可以工作了:

decode_url(encode_url(u'Cataño')) == u'Cataño'

有什么建议吗?

解决方案

Python 的 urllib.quote 和 urllib.unquote 不能正确处理 Unicode

urllib 根本不处理 Unicode.根据定义,URL 不包含非 ASCII 字符.当您处理 urllib 时,您应该只使用字节字符串.如果您希望这些字符代表 Unicode 字符,则必须手动对其进行编码和解码.

IRI 可以包含非 ASCII 字符,将它们编码为 UTF-8 序列,但 Python在这一点上,没有 irilib.

<块引用>

将值编码为 UTF8 也不起作用:

在 [6]: 打印 urllib.unquote(urllib.quote(u'Cataño'.encode('utf8')))卡塔奥

啊,现在您正在控制台中输入 Unicode,并在控制台中执行 print-Unicode.这通常是不可靠的,尤其是在 Windows 和您的情况下使用 IPython 控制台.

用反斜杠序列输入很长的路,你可以更容易地看到 urllib 位确实有效:

<预><代码>>>>u'Cata\u00F1o'.encode('utf-8')'目录\xC3\xB1o'>>>urllib.quote(_)'Cata%C3%B1o'>>>urllib.unquote(_)'目录\xC3\xB1o'>>>_.decode('utf-8')u'Cata\xF1o'

Python's urllib.quote and urllib.unquote do not handle Unicode correctly in Python 2.6.5. This is what happens:

In [5]: print urllib.unquote(urllib.quote(u'Cataño'))
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)

/home/kkinder/<ipython console> in <module>()

/usr/lib/python2.6/urllib.pyc in quote(s, safe)
   1222             safe_map[c] = (c in safe) and c or ('%%%02X' % i)
   1223         _safemaps[cachekey] = safe_map
-> 1224     res = map(safe_map.__getitem__, s)
   1225     return ''.join(res)
   1226 

KeyError: u'\xc3'

Encoding the value to UTF8 also does not work:

In [6]: print urllib.unquote(urllib.quote(u'Cataño'.encode('utf8')))
Cataño

It's recognized as a bug and there is a fix, but not for my version of Python.

What I'd like is something similar to urllib.quote/urllib.unquote, but handles unicode variables correctly, such that this code would work:

decode_url(encode_url(u'Cataño')) == u'Cataño'

Any recommendations?

解决方案

Python's urllib.quote and urllib.unquote do not handle Unicode correctly

urllib does not handle Unicode at all. URLs don't contain non-ASCII characters, by definition. When you're dealing with urllib you should use only byte strings. If you want those to represent Unicode characters you will have to encode and decode them manually.

IRIs can contain non-ASCII characters, encoding them as UTF-8 sequences, but Python doesn't, at this point, have an irilib.

Encoding the value to UTF8 also does not work:

In [6]: print urllib.unquote(urllib.quote(u'Cataño'.encode('utf8')))
Cataño

Ah, well now you're typing Unicode into a console, and doing print-Unicode to the console. This is generally unreliable, especially in Windows and in your case with the IPython console.

Type it out the long way with backslash sequences and you can more easily see that the urllib bit does actually work:

>>> u'Cata\u00F1o'.encode('utf-8')
'Cata\xC3\xB1o'
>>> urllib.quote(_)
'Cata%C3%B1o'

>>> urllib.unquote(_)
'Cata\xC3\xB1o'
>>> _.decode('utf-8')
u'Cata\xF1o'

这篇关于是否有可用于 urllib.quote 和 urllib.unquote 在 Python 2.6.5 中的 unicode-ready 替代品?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆