Python UTF-8 Latin-1显示错误的字符 [英] Python UTF-8 Latin-1 displays wrong character
问题描述
我正在编写一个非常小的脚本,可以将latin-1字符转换为unicode(我是Python的完整初学者).
I'm writing a very small script that can convert latin-1 characters into unicode (I'm a complete beginner in Python).
我尝试了这样的方法:
def latin1_to_unicode(character):
uni = character.decode('latin-1').encode("utf-8")
retutn uni
它对于不是特定于latin-1集的字符很好用,但是如果我尝试以下示例:
It works fine for characters that are not specific to the latin-1 set, but if I try the following example:
print latin1_to_Unicode('å')
它返回Ã¥
而不是å
.其他字母,例如æ
和ø
.
It returns å
instead of å
. Same goes for other letters like æ
and ø
.
任何人都可以解释为什么会这样吗?谢谢
Can anyone please explain why this is happening? Thanks
我的脚本中有#-*-编码:utf8-*-
声明,如果对问题有影响的话
I have the # -*- coding: utf8 -*-
declaration in my script, if it matters any to the problem
推荐答案
您的源代码已编码为UTF-8,但是您正在将数据解码为Latin-1.请勿这样做,您正在创建 Mojibake .
Your source code is encoded to UTF-8, but you are decoding the data as Latin-1. Don't do that, you are creating a Mojibake.
改为从UTF-8解码,并且不要再次编码. print
将写入 sys.stdout
,该文件已使用您的终端或控制台编解码器配置(在Python启动时检测到).
Decode from UTF-8 instead, and don't encode again. print
will write to sys.stdout
which will have been configured with your terminal or console codec (detected when Python starts).
我的终端配置为UTF-8,因此当我在终端中输入å
字符时,会生成UTF-8数据:
My terminal is configured for UTF-8, so when I enter the å
character in my terminal, UTF-8 data is produced:
>>> 'å'
'\xc3\xa5'
>>> 'å'.decode('latin1')
u'\xc3\xa5'
>>> print 'å'.decode('latin1')
Ã¥
您可以看到该字符使用了两个字节;当使用配置为使用UTF-8的编辑器保存Python源代码时,Python将从磁盘读取完全相同的字节以放入您的字节串中.
You can see that the character uses two bytes; when saving your Python source with an editor configured to use UTF-8, Python reads the exact same bytes from disk to put into your bytestring.
将这两个字节解码为Latin-1会产生两个对应于Latin-1编解码器的Unicode代码点.
Decoding those two bytes as Latin-1 produces two Unicode codepoints corresponding to the Latin-1 codec.
您可能想对Unicode和编码之间的差异以及与Python的关系进行一些研究:
You probably want to do some studying on the difference between Unicode and encodings, and how that relates to Python:
-
每个软件开发人员绝对,肯定地必须绝对了解Unicode和字符集(没有任何借口)!),乔尔·斯波斯基(Joel Spolsky)
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
这篇关于Python UTF-8 Latin-1显示错误的字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!