Python拉丁字符和Unicode [英] Python Latin Characters and Unicode
问题描述
我有一个树形结构,其中的关键字可能包含一些拉丁字符.我有一个遍历树上所有叶子并在特定条件下将每个关键字添加到列表的函数.
I have a tree structure in which keywords may contain some latin characters. I have a function which loops through all leaves of the tree and adds each keyword to a list under certain conditions.
这是我将这些关键字添加到列表中的代码:
Here is the code I have for adding these keywords to the list:
print "Adding: " + self.keyword
leaf_list.append(self.keyword)
print leaf_list
如果在这种情况下的关键字是université
,那么我的输出是:
If the keyword in this case is université
, then my output is:
Adding: université
['universit\xc3\xa9']
打印功能似乎可以正确显示拉丁字符,但是当我将其添加到列表中时,它就会被解码.
It appears that the print function properly shows the latin character, but when I add it to the list, it gets decoded.
我该如何更改?我需要能够使用标准拉丁字符而不是它们的解码版本来打印列表.
How can I change this? I need to be able to print the list with the standard latin characters, not the decoded version of them.
推荐答案
您没有unicode对象,但是具有UTF-8编码文本的字节字符串.如果将终端配置为处理UTF-8文本,则可以将此类字节字符串打印到终端 .
You don't have unicode objects, but byte strings with UTF-8 encoded text. Printing such byte strings to your terminal may work if your terminal is configured to handle UTF-8 text.
将列表转换为字符串时,列表内容显示为表示形式; repr()
函数的结果.字符串对象的表示形式对可打印ASCII范围之外的任何字节使用转义码;例如,换行符由\n
代替.您的UTF-8字节由\xhh
转义序列表示.
When converting a list to string, the list contents are shown as representations; the result of the repr()
function. The representation of a string object uses escape codes for any bytes outside of the printable ASCII range; newlines are replaced by \n
for example. Your UTF-8 bytes are represented by \xhh
escape sequences.
如果您使用的是Unicode对象,则表示形式将使用\xhh
转义符 still ,但仅适用于Latin-1范围(ASCII之外)的Unicode代码点(其余显示为\Uhhhhhhhh
转义取决于它们的代码点);在打印Python时,Python会自动将这些值编码为适合您终端的正确编码:
If you were using Unicode objects, the representation would use \xhh
escapes still, but for Unicode codepoints in the Latin-1 range (outside ASCII) only (the rest are shown with \uhhhh
and \Uhhhhhhhh
escapes depending on their codepoint); when printing Python automatically encodes such values to the correct encoding for your terminal:
>>> u'université'
u'universit\xe9'
>>> len(u'université')
10
>>> print u'université'
université
将此与字节字符串进行比较:
Compare this to byte strings:
>>> 'université'
'universit\xc3\xa9'
>>> len('université')
11
>>> 'université'.decode('utf8')
u'universit\xe9'
>>> print 'université'
université
请注意,该长度反映了é
代码点也被编码为两个字节.顺便说一句,这是我的终端在将é
字符粘贴到Python会话中时向Python提供了\xc3\xa9
字节,因为它被配置为使用UTF-8,并且Python在我检测到该字节并对其进行解码时定义了u'..'
Unicode对象文字.
Note that the length reflects that the é
codepoint is encoded to two bytes as well. It was my terminal that presented Python with the \xc3\xa9
bytes when pasting the é
character into the Python session, by the way, as it is configured to use UTF-8, and Python has detected this and decoded the bytes when I defined a u'..'
Unicode object literal.
我强烈建议您阅读以下文章,以了解Python如何处理Unicode,以及Unicode文本和编码的字节字符串之间的区别:
I strongly recommend you read the following articles to understand how Python handles Unicode, and what the difference is between Unicode text and encoded byte strings:
-
每个软件开发人员绝对,肯定地必须绝对了解Unicode和字符集(没有任何借口) !),乔尔·斯波斯基(Joel Spolsky)
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
这篇关于Python拉丁字符和Unicode的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!