python url unquote后跟unicode解码 [英] python url unquote followed by unicode decode

查看:196
本文介绍了python url unquote后跟unicode解码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个像'%C3%A7%C3%B6asd+fjkls%25asd'这样的unicode字符串,我想对该字符串进行解码.
我使用了urllib.unquote_plus(str),但是它工作不正确.

I have a unicode string like '%C3%A7%C3%B6asd+fjkls%25asd' and I want to decode this string.
I used urllib.unquote_plus(str) but it works wrong.

  • 预期:çöasd+fjkls%asd
  • 结果:çöasd fjkls%asd
  • expected : çöasd+fjkls%asd
  • result : çöasd fjkls%asd

双编码utf-8字符(%C3%A7%C3%B6)被错误解码.
我的python版本是Linux发行版下的2.7. 获得预期结果的最佳方法是什么?

double coded utf-8 characters(%C3%A7 and %C3%B6) are decoded wrong.
My python version is 2.7 under a linux distro. What is the best way to get expected result?

推荐答案

您有3或4或5个问题……但是repr()unicodedata.name()是您的朋友;它们可以清楚地向您准确显示您所拥有的东西,而不会因使用不同控制台编码的人传达print fubar的结果而引起的困惑.

You have 3 or 4 or 5 problems ... but repr() and unicodedata.name() are your friends; they unambiguously show you exactly what you have got, without the confusion engendered by people with different console encodings communicating the results of print fubar.

摘要:或者(a)您以unicode对象开始并将unquote函数应用于该对象,或者(b)您以str对象开始并且您的控制台编码不是UTF-8.

Summary: either (a) you start with a unicode object and apply the unquote function to that or (b) you start off with a str object and your console encoding is not UTF-8.

如果您说的是从unicode对象开始的话:

If as you say you start off with a unicode object:

>>> s0 = u'%C3%A7%C3%B6asd+fjkls%25asd'
>>> print repr(s0)
u'%C3%A7%C3%B6asd+fjkls%25asd'

这是一个偶然的废话.如果将urllibX.unquote_YYYY()应用于它,则会得到另一个废话unicode对象(u'\xc3\xa7\xc3\xb6asd+fjkls%asd'),该对象将在打印时导致显示的症状.您应该立即将原始的unicode对象转换为str对象:

this is an accidental nonsense. If you apply urllibX.unquote_YYYY() to it, you get another nonsense unicode object (u'\xc3\xa7\xc3\xb6asd+fjkls%asd') which would cause your shown symptoms when printed. You should convert your original unicode object to a str object immediately:

>>> s1 = s0.encode('ascii')
>>> print repr(s1)
'%C3%A7%C3%B6asd+fjkls%25asd'

然后您应该取消报价:

>>> import urllib2
>>> s2 = urllib2.unquote(s1)
>>> print repr(s2)
'\xc3\xa7\xc3\xb6asd+fjkls%asd'

查看其中的前4个字节,它以UTF-8编码.如果您执行print s2,则如果您的控制台希望使用UTF-8,它将看起来不错,但是如果希望使用ISO-8859-1(又名latin1),您将看到症状性垃圾(第一个字符为A-波浪号).让我们暂时停下来想一想,然后将其转换为Unicode对象:

Looking at the first 4 bytes of that, it's encoded in UTF-8. If you do print s2, it will look OK if your console is expecting UTF-8, but if it's expecting ISO-8859-1 (aka latin1) you'll see your symptomatic rubbish (first char will be A-tilde). Let's park that thought for a moment and convert it to a Unicode object:

>>> s3 = s2.decode('utf8')
>>> print repr(s3)
u'\xe7\xf6asd+fjkls%asd'

并检查它,看看我们实际上得到了什么:

and inspect it to see what we've actually got:

>>> import unicodedata
>>> for c in s3[:6]:
...     print repr(c), unicodedata.name(c)
...
u'\xe7' LATIN SMALL LETTER C WITH CEDILLA
u'\xf6' LATIN SMALL LETTER O WITH DIAERESIS
u'a' LATIN SMALL LETTER A
u's' LATIN SMALL LETTER S
u'd' LATIN SMALL LETTER D
u'+' PLUS SIGN

看起来像您所说的那样.现在我们来讨论在控制台上显示它的问题.注意:看到"cp850"时请不要惊慌;我正在轻便地执行此操作,而恰好在Windows的命令提示符中执行此操作.

Looks like what you said you expected. Now we come to the question of displaying it on your console. Note: don't freak out when you see "cp850"; I'm doing this portably and just happen to be doing this in a Command Prompt on Windows.

>>> import sys
>>> sys.stdout.encoding
'cp850'
>>> print s3
çöasd+fjkls%asd

注意:unicode对象是使用sys.stdout.encoding显式编码的.幸运的是,s3中的所有unicode字符都可以用该编码表示(以及cp1252和latin1).

Note: the unicode object was explicitly encoded using sys.stdout.encoding. Fortunately all the unicode characters in s3 are representable in that encoding (and cp1252 and latin1).

这篇关于python url unquote后跟unicode解码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆