python中的双解码unicode [英] Double-decoding unicode in python

查看:175
本文介绍了python中的双解码unicode的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在针对似乎渴望返回双UTF-8编码字符串的应用程序进行工作.

I am working against an application that seems keen on returning, what I believe to be, double UTF-8 encoded strings.

我发送使用UTF-8编码的字符串u'XüYß',因此成为X\u00fcY\u00df(等于X\xc3\xbcY\xc3\x9f).

I send the string u'XüYß' encoded using UTF-8, thus becoming X\u00fcY\u00df (equal to X\xc3\xbcY\xc3\x9f).

服务器应该只是回显我发送的内容,但返回以下内容:X\xc3\x83\xc2\xbcY\xc3\x83\xc2\x9f(应为X\xc3\xbcY\xc3\x9f).如果我使用str.decode('utf-8')对其进行解码,则它会变成u'X\xc3\xbcY\xc3\x9f',它看起来像是... unicode字符串,其中包含使用UTF-8编码的原始字符串.

The server should simply echo what I sent it, yet returns the following: X\xc3\x83\xc2\xbcY\xc3\x83\xc2\x9f (should be X\xc3\xbcY\xc3\x9f). If I decode it using str.decode('utf-8') becomes u'X\xc3\xbcY\xc3\x9f', which looks like a ... unicode-string, containing the original string encoded using UTF-8.

但是Python不允许我在不重新编码的情况下解码unicode字符串-由于某种原因而失败,这使我逃脱了

But Python won't let me decode a unicode string without re-encoding it first - which fails for some reason, that escapes me:

>>> ret = 'X\xc3\x83\xc2\xbcY\xc3\x83\xc2\x9f'.decode('utf-8')
>>> ret
u'X\xc3\xbcY\xc3\x9f'
>>> ret.decode('utf-8')
# Throws UnicodeEncodeError: 'ascii' codec can't encode ...

如何说服Python重新解码字符串? -和/或是否有(实用)调试字符串中实际内容的方法,尽管所有print隐式转换都使用了该方法,却没有传递它?

How do I persuade Python to re-decode the string? - and/or is there any (practical) way of debugging what's actually in the strings, without passing it though all the implicit conversion print uses?

(是的,我已经向服务器端的开发人员报告了此行为.)

(And yes, I have reported this behaviour with the developers of the server-side.)

推荐答案

ret.decode()尝试使用系统编码(在您的情况下为ascii)隐式地对ret进行编码.

ret.decode() tries implicitly to encode ret with the system encoding - in your case ascii.

如果您明确编码unicode字符串,则应该没问题.有内置的编码可以满足您的需求:

If you explicitly encode the unicode string, you should be fine. There is a builtin encoding that does what you need:

>>> 'X\xc3\xbcY\xc3\x9f'.encode('raw_unicode_escape').decode('utf-8')
'XüYß'

真的,.encode('latin1')(或cp1252)可以,因为那是服务器几乎在使用的东西. raw_unicode_escape编解码器只会在最后为您提供可识别的内容,而不会引发异常:

Really, .encode('latin1') (or cp1252) would be OK, because that's what the server is almost cerainly using. The raw_unicode_escape codec will just give you something recognizable at the end instead of raising an exception:

>>> '€\xe2\x82\xac'.encode('raw_unicode_escape').decode('utf8')
'\\u20ac€'

>>> '€\xe2\x82\xac'.encode('latin1').decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 0: ordinal not in range(256)

如果遇到这种混合数据,则可以再次使用编解码器来规范化所有内容:

In case you run into this sort of mixed data, you can use the codec again, to normalize everything:

>>> '€\xe2\x82\xac'.encode('raw_unicode_escape').decode('utf8')
'\\u20ac€'

>>> '\\u20ac€'.encode('raw_unicode_escape')
b'\\u20ac\\u20ac'
>>> '\\u20ac€'.encode('raw_unicode_escape').decode('raw_unicode_escape')
'€€'

这篇关于python中的双解码unicode的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆