unicode Python字符串中的字节 [英] Bytes in a unicode Python string

查看：185 发布时间：2016/11/19 12:50:46 python unicode utf-8 character-encoding

本文介绍了unicode Python字符串中的字节的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在Python 2中，Unicode字符串可能包含unicode和字节：

In Python 2, Unicode strings may contain both unicode and bytes:

a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'

我知道这是绝对的不是应该写的在他自己的代码，但这是一个字符串，我必须处理。

I understand that this is absolutely not something one should write in his own code, but this is a string that I have to deal with.

上述字符串中的字节为ек（Unicode \\\е\\\к ）。


The bytes in the string above are UTF-8 for ек (Unicode \u0435\u043a).
我的目标是获取一个包含Unicode中所有内容的unicode字符串，即Русскийек（ \\\Р\\\у\\\с\\\с\\\к\\\и\\\й \\\е\\\к ）。
My objective is to get a unicode string containing everything in Unicode, which is to say Русский ек (\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a).
将其编码为UTF-8会生成
Encoding it to UTF-8 yields
>>> a.encode('utf-8')
'\xd0\xa0\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xb8\xd0\xb9 \xc3\x90\xc2\xb5\xc3\x90\xc2\xba'

然后从UTF-8解码得到的字符串中包含字节，这是不好的：
Which then decoded from UTF-8 gives the initial string with bytes in them, which is not good:
>>> a.encode('utf-8').decode('utf-8')
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'

但我发现了一个解决问题的方法： 
I found a hacky way to solve the problem, however:
>>> repr(a)
"u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \\xd0\\xb5\\xd0\\xba'"
>>> eval(repr(a)[1:])
'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \xd0\xb5\xd0\xba'
>>> s = eval(repr(a)[1:]).decode('utf8')
>>> s
u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \u0435\u043a'
# Almost there, the bytes are proper now but the former real-unicode characters
# are now escaped with \u's; need to un-escape them.
>>> import re
>>> re.sub(u'\\\\u([a-f\\d]+)', lambda x : unichr(int(x.group(1), 16)), s)
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a' # Success!

这很好用，但是由于使用 eval ， repr ，然后附加正则表达式unicode字符串表示。 

This works fine but looks very hacky due to its use of eval, repr, and then additional regex'ing of the unicode string representation. Is there a cleaner way?
推荐答案
 
 
 在Python 2中，Unicode字符串可能包含unicode和bytes ：

  In Python 2, Unicode strings may contain both unicode and bytes:
不，他们可能不会。它们包含Unicode字符。
No, they may not. They contain Unicode characters.
在原始字符串中， \xd0 不是UTF的一部分-8编码。它是具有代码点208的Unicode字符。 u'\xd0' ==  u'\\\Ð' 。它只是发生，在Python 2的Unicode字符串的 repr 更喜欢用 \x 即代码点<256）。
Within the original string, \xd0 is not a byte that's part of a UTF-8 encoding. It is the Unicode character with code point 208. u'\xd0' == u'\u00d0'. It just happens that the repr for Unicode strings in Python 2 prefers to represent characters with \x escapes where possible (i.e. code points < 256).
没有办法查看字符串，并告诉 \xd0  byte应该是某些UTF-8编码字符的一部分，或者它本身实际上代表该Unicode字符。
There is no way to look at the string and tell that the \xd0 byte is supposed to be part of some UTF-8 encoded character, or if it actually stands for that Unicode character by itself.
但是，可以总是将这些值解释为编码的，你可以尝试写一些依次分析每个字符的东西（使用 ord 转换为代码点整数），解码字符< 256作为UTF-8，并传递字符> = 256，因为他们是。
However, if you assume that you can always interpret those values as encoded ones, you could try writing something that analyzes each character in turn (use ord to convert to a code-point integer), decodes characters < 256 as UTF-8, and passes characters >= 256 as they were.

                        这篇关于unicode Python字符串中的字节的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

unicode Python字符串中的字节 [英] Bytes in a unicode Python string

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

unicode Python字符串中的字节 [英] Bytes in a unicode Python string

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭