unicode Python字符串中的字节 [英] Bytes in a unicode Python string

查看:185
本文介绍了unicode Python字符串中的字节的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Python 2中,Unicode字符串可能包含unicode和字节:

In Python 2, Unicode strings may contain both unicode and bytes:

a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'

我知道这是绝对的不是应该写的在他自己的代码,但这是一个字符串,我必须处理。

I understand that this is absolutely not something one should write in his own code, but this is a string that I have to deal with.

上述字符串中的字节为ек(Unicode \\\е\\\к )。

The bytes in the string above are UTF-8 for ек (Unicode \u0435\u043a).

我的目标是获取一个包含Unicode中所有内容的unicode字符串,即Русскийек \\\Р\\\у\\\с\\\с\\\к\\\и\\\й \\\е\\\к )。

My objective is to get a unicode string containing everything in Unicode, which is to say Русский ек (\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a).

将其编码为UTF-8会生成

Encoding it to UTF-8 yields

>>> a.encode('utf-8')
'\xd0\xa0\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xb8\xd0\xb9 \xc3\x90\xc2\xb5\xc3\x90\xc2\xba'

然后从UTF-8解码得到的字符串中包含字节,这是不好的:

Which then decoded from UTF-8 gives the initial string with bytes in them, which is not good:

>>> a.encode('utf-8').decode('utf-8')
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'

但我发现了一个解决问题的方法:

I found a hacky way to solve the problem, however:

>>> repr(a)
"u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \\xd0\\xb5\\xd0\\xba'"
>>> eval(repr(a)[1:])
'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \xd0\xb5\xd0\xba'
>>> s = eval(repr(a)[1:]).decode('utf8')
>>> s
u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \u0435\u043a'
# Almost there, the bytes are proper now but the former real-unicode characters
# are now escaped with \u's; need to un-escape them.
>>> import re
>>> re.sub(u'\\\\u([a-f\\d]+)', lambda x : unichr(int(x.group(1), 16)), s)
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a' # Success!

这很好用,但是由于使用 eval repr ,然后附加正则表达式unicode字符串表示。

This works fine but looks very hacky due to its use of eval, repr, and then additional regex'ing of the unicode string representation. Is there a cleaner way?

推荐答案


在Python 2中,Unicode字符串可能包含unicode和bytes :

In Python 2, Unicode strings may contain both unicode and bytes:

不,他们可能不会。它们包含Unicode字符。

No, they may not. They contain Unicode characters.

在原始字符串中, \xd0 不是UTF的一部分-8编码。它是具有代码点208的Unicode字符。 u'\xd0' == u'\\\Ð' 。它只是发生,在Python 2的Unicode字符串的 repr 更喜欢用 \x 即代码点<256)。

Within the original string, \xd0 is not a byte that's part of a UTF-8 encoding. It is the Unicode character with code point 208. u'\xd0' == u'\u00d0'. It just happens that the repr for Unicode strings in Python 2 prefers to represent characters with \x escapes where possible (i.e. code points < 256).

没有办法查看字符串,并告诉 \xd0 byte应该是某些UTF-8编码字符的一部分,或者它本身实际上代表该Unicode字符。

There is no way to look at the string and tell that the \xd0 byte is supposed to be part of some UTF-8 encoded character, or if it actually stands for that Unicode character by itself.

但是,可以总是将这些值解释为编码的,你可以尝试写一些依次分析每个字符的东西(使用 ord 转换为代码点整数),解码字符< 256作为UTF-8,并传递字符> = 256,因为他们是。

However, if you assume that you can always interpret those values as encoded ones, you could try writing something that analyzes each character in turn (use ord to convert to a code-point integer), decodes characters < 256 as UTF-8, and passes characters >= 256 as they were.

这篇关于unicode Python字符串中的字节的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆