为什么这种转换到 utf8 不起作用? [英] Why doesn't this conversion to utf8 work?

查看:37
本文介绍了为什么这种转换到 utf8 不起作用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个子进程命令,它输出一些字符,例如xf1".我正在尝试将其解码为 utf8,但出现错误.

I have a subprocess command that outputs some characters such as 'xf1'. I'm trying to decode it as utf8 but I get an error.

s = 'xf1'
s.decode('utf-8')

以上抛出:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xf1 in position 0: unexpected end of data

它在我使用latin-1"时有效,但 utf8 不应该也有效吗?我的理解是 latin1 是 utf8 的子集.

It works when I use 'latin-1' but shouldn't utf8 work as well? My understanding is that latin1 is a subset of utf8.

我在这里遗漏了什么吗?

Am I missing something here?

print s # ñ
repr(s) # returns "'\xa9'"

推荐答案

您已经将 Unicode 与 UTF-8 混淆了.Latin-1 是 Unicode 的子集,但不是 UTF-8 的子集.避免像瘟疫一样考虑单个代码单元.只需使用代码点.不要考虑 UTF-8.想想Unicode吧.这就是您感到困惑的地方.

You have confused Unicode with UTF-8. Latin-1 is a subset of Unicode, but it is not a subset of UTF-8. Avoid like the plague ever thinking about individual code units. Just use code points. Do not think about UTF-8. Think about Unicode instead. This is where you are being confused.

在 Python 中使用 Unicode 非常简单.尤其是使用 Python 3 和宽构建,这是我使用 Python 的唯一方式,但如果您小心坚持使用 UTF-8,您仍然可以在窄构建下使用遗留的 Python 2.

Using Unicode in Python is very easy. It’s especially with Python 3 and wide builds, the only way I use Python, but you can still use the legacy Python 2 under a narrow build if you are careful about sticking to UTF-8.

要做到这一点,请始终将源代码编码和输出编码正确地转换为 UTF-8.现在,不要再考虑 UTF-8 了,在整个 Python 程序中只使用 UTF-8 文字、逻辑代码点编号或符号字符名称.

To do this, always your source code encoding and your output encoding correctly to UTF-8. Now stop thinking of UTF-anything and use only UTF-8 literals, logical code point numbers, or symbolic character names throughout your Python program.

这是带有行号的源代码:

Here’s the source code with line numbers:

% cat -n /tmp/py
     1  #!/usr/bin/env python3.2
     2  # -*- coding: UTF-8 -*-
     3  
     4  from __future__ import unicode_literals
     5  from __future__ import print_function
     6  
     7  import sys
     8  import os
     9  import re
    10  
    11  if not (("PYTHONIOENCODING" in os.environ)
    12              and
    13          re.search("^utf-?8$", os.environ["PYTHONIOENCODING"], re.I)):
    14      sys.stderr.write(sys.argv[0] + ": Please set your PYTHONIOENCODING envariable to utf8
")
    15      sys.exit(1)
    16  
    17  print('1a: el nixF1o')
    18  print('2a: el ninu0303o')
    19  
    20  print('1a: el niño')
    21  print('2b: el niño')
    22  
    23  print('1c: el niN{LATIN SMALL LETTER N WITH TILDE}o')
    24  print('2c: el ninN{COMBINING TILDE}o')

这里是打印函数,使用 x{⋯} 符号:

And here are print functions with their non-ASCII characters uniquoted using the x{⋯} notation:

% grep -n ^print /tmp/py | uniquote -x
17:print('1a: el nixF1o')
18:print('2a: el ninu0303o')
20:print('1b: el nix{F1}o')
21:print('2b: el ninx{303}o')
23:print('1c: el niN{LATIN SMALL LETTER N WITH TILDE}o')
24:print('2c: el ninN{COMBINING TILDE}o')

演示程序的示例运行

这是该程序的示例运行,显示了执行此操作的三种不同方式(a、b 和 c):第一个设置为源代码中的文字(这将受到 StackOverflow 的 NFC 转换的影响,因此不能值得信赖!!!)和后两组分别带有数字 Unicode 代码点符号 Unicode 字符名称,再次uniquoted 这样你就可以看到真正的东西:

Sample Runs of Demo Program

Here’s a sample run of that program that shows the three different ways (a, b, and c) of doing it: the first set as literals in your source code (which will be subject to StackOverflow’s NFC conversions and so cannot be trusted!!!) and the second two sets with numeric Unicode code points and with symbolic Unicode character names respectively, again uniquoted so you can see what things really are:

% python /tmp/py
1a: el niño
2a: el niño
1b: el niño
2b: el niño
1c: el niño
2c: el niño

% python /tmp/py | uniquote -x
1a: el nix{F1}o
2a: el ninx{303}o
1b: el nix{F1}o
2b: el ninx{303}o
1c: el nix{F1}o
2c: el ninx{303}o

% python /tmp/py | uniquote -v
1a: el niN{LATIN SMALL LETTER N WITH TILDE}o
2a: el ninN{COMBINING TILDE}o
1b: el niN{LATIN SMALL LETTER N WITH TILDE}o
2b: el ninN{COMBINING TILDE}o
1c: el niN{LATIN SMALL LETTER N WITH TILDE}o
2c: el ninN{COMBINING TILDE}o

我真的不喜欢看二进制,但这是二进制字节的样子:

I really dislike looking at binary, but here is what that looks like as binary bytes:

% python /tmp/py | uniquote -b
1a: el nixC3xB1o
2a: el ninxCCx83o
1b: el nixC3xB1o
2b: el ninxCCx83o
1c: el nixC3xB1o
2c: el ninxCCx83o

故事的寓意

即使您使用 UTF-8 源代码,您也应该只考虑和使用逻辑 Unicode 代码点数字(或符号命名字符),而不是作为 UTF-8 串行表示基础的单个 8 位代码单元(或用于UTF-16 的问题).需要代码单元而不是代码点的情况极为罕见,这只会让您感到困惑.

The Moral of the Story

Even when you use UTF-8 source, you should think and use only logical Unicode code point numbers (or symbolic named characters), not the individual 8-bit code units that underlie the serial representation of UTF-8 (or for that matter of UTF-16). It is extremely rare to need code units instead of code points, and it just confuses you.

如果您使用 Python3 的广泛版本,您将获得比使用这些选择的替代方案更可靠的行为,但这是 UTF-32 问题,而不是 UTF-8 问题.UTF-32 和 UTF-8 都很容易使用,只要你顺其自然.

You will also get more reliably behavior if you use a wide build of Python3 than you will get with alternatives to those choices, but that is a UTF-32 matter, not a UTF-8 one. Both UTF-32 and UTF-8 are easy to work with, if you just go with the flow.

这篇关于为什么这种转换到 utf8 不起作用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆