Unicode Python 字符串中的字节数 [英] Bytes in a unicode Python string

查看:32
本文介绍了Unicode Python 字符串中的字节数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Python 2 中,Unicode 字符串可能同时包含 unicode 和字节:

a = u'u0420u0443u0441u0441u043au0438u0439xd0xb5xd0xba'

我知道这绝对不应该在他自己的代码中写,但这是一个我必须处理的字符串.

上面字符串中的字节是ек的UTF-8(Unicodeu0435u043a).

我的目标是获得一个包含 Unicode 中所有内容的 unicode 字符串,也就是说 Русский ек (u0420u0443u0441u0441u043au0438u0439u0435u043a).

将其编码为 UTF-8 产生

<预><代码>>>>a.encode('utf-8')'xd0xa0xd1x83xd1x81xd1x81xd0xbaxd0xb8xd0xb9xc3x90xc2xb5xc3x90xc2xba'

然后从 UTF-8 解码得到包含字节的初始字符串,这不好:

<预><代码>>>>a.encode('utf-8').decode('utf-8')u'u0420u0443u0441u0441u043au0438u0439xd0xb5xd0xba'

我找到了一种解决问题的方法:

<预><代码>>>>代表(一)u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439\xd0\xb5\xd0\xba'">>>评估(repr(a)[1:])'\u0420\u0443\u0441\u0441\u043a\u0438\u0439xd0xb5xd0xba'>>>s = eval(repr(a)[1:]).decode('utf8')>>>秒u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439\u0435\u043a'# 差不多了,字节现在是正确的,但是以前的真正的 unicode 字符# 现在用 u's 转义;需要取消逃避他们.>>>进口重新>>>re.sub(u'\\u([a-f\d]+)', lambda x : unichr(int(x.group(1), 16)), s)u'u0420u0443u0441u0441u043au0438u0439u0435u043a' # 成功!

这工作正常,但由于使用了 evalrepr,然后对 unicode 字符串表示进行了额外的正则表达式,所以看起来很笨拙.有没有更干净的方法?

解决方案

在 Python 2 中,Unicode 字符串可能同时包含 unicode 和字节:

不,他们可能不会.它们包含 Unicode 字符.

在原始字符串中,xd0 不是属于 UTF-8 编码的字节.它是代码点为 208 的 Unicode 字符.u'xd0' == u'u00d0'.碰巧的是,Python 2 中 Unicode 字符串的 repr 更喜欢在可能的情况下用 x 转义来表示字符(即代码点 <256).

无法查看字符串并判断 xd0 字节应该是某些 UTF-8 编码字符的一部分,或者它本身是否实际上代表该 Unicode 字符.

然而,如果你假设你总是可以将这些值解释为编码的值,你可以尝试编写一些依次分析每个字符的东西(使用 ord 转换为代码点整数),解码字符 <256 为 UTF-8,并按原样传递 >= 256 字符.

In Python 2, Unicode strings may contain both unicode and bytes:

a = u'u0420u0443u0441u0441u043au0438u0439 xd0xb5xd0xba'

I understand that this is absolutely not something one should write in his own code, but this is a string that I have to deal with.

The bytes in the string above are UTF-8 for ек (Unicode u0435u043a).

My objective is to get a unicode string containing everything in Unicode, which is to say Русский ек (u0420u0443u0441u0441u043au0438u0439 u0435u043a).

Encoding it to UTF-8 yields

>>> a.encode('utf-8')
'xd0xa0xd1x83xd1x81xd1x81xd0xbaxd0xb8xd0xb9 xc3x90xc2xb5xc3x90xc2xba'

Which then decoded from UTF-8 gives the initial string with bytes in them, which is not good:

>>> a.encode('utf-8').decode('utf-8')
u'u0420u0443u0441u0441u043au0438u0439 xd0xb5xd0xba'

I found a hacky way to solve the problem, however:

>>> repr(a)
"u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'"
>>> eval(repr(a)[1:])
'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 xd0xb5xd0xba'
>>> s = eval(repr(a)[1:]).decode('utf8')
>>> s
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 u0435u043a'
# Almost there, the bytes are proper now but the former real-unicode characters
# are now escaped with u's; need to un-escape them.
>>> import re
>>> re.sub(u'\\u([a-f\d]+)', lambda x : unichr(int(x.group(1), 16)), s)
u'u0420u0443u0441u0441u043au0438u0439 u0435u043a' # Success!

This works fine but looks very hacky due to its use of eval, repr, and then additional regex'ing of the unicode string representation. Is there a cleaner way?

解决方案

In Python 2, Unicode strings may contain both unicode and bytes:

No, they may not. They contain Unicode characters.

Within the original string, xd0 is not a byte that's part of a UTF-8 encoding. It is the Unicode character with code point 208. u'xd0' == u'u00d0'. It just happens that the repr for Unicode strings in Python 2 prefers to represent characters with x escapes where possible (i.e. code points < 256).

There is no way to look at the string and tell that the xd0 byte is supposed to be part of some UTF-8 encoded character, or if it actually stands for that Unicode character by itself.

However, if you assume that you can always interpret those values as encoded ones, you could try writing something that analyzes each character in turn (use ord to convert to a code-point integer), decodes characters < 256 as UTF-8, and passes characters >= 256 as they were.

这篇关于Unicode Python 字符串中的字节数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆