为什么utf8不能转换这个工作? [英] Why doesn't this conversion to utf8 work?

查看:85
本文介绍了为什么utf8不能转换这个工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个subprocess命令输出一些字符,如'\xf1'。我正在尝试将其解码为utf8,但是我收到错误。

  s ='\xf1'
s.decode('utf-8')

以上抛出:

  UnicodeDecodeError:'utf8'编解码器无法解码位置0中的字节0xf1:意外的数据结束

当我使用'latin-1'但是不应该utf8工作时,它可以工作吗?我的理解是,latin1是utf8的一个子集。



我在这里缺少一些东西?



编辑: / p>

 打印s#ñ
repr(s)#返回'\\xa9'


解决方案

您已将Unicode与UTF-8混淆。 Latin-1是Unicode的一个子集,但它不是UTF-8的子集。 避免像瘟疫一样考虑个别代码单位。只需使用代码点。不要考虑UTF-8。想想Unicode。这是您感到困惑的地方。



演示程序的源代码



在Python中使用Unicode非常简单。特别是使用Python 3和宽版本,这是我使用Python的唯一方法,但如果您谨慎使用UTF-8,则仍然可以在狭窄的构建下使用旧版Python 2。要执行此操作,请始终将源代码编码和输出编码正确输入到UTF-8。现在停止思考UTF,并且在整个Python程序中只使用UTF-8文字,逻辑代码点号或符号字符名。



这是源代码的行号:

 %cat  - n / tmp / py 
1#!/ usr / bin / env python3.2
2# - * - 编码:UTF-8 - * -
3
4 __future__从__future__导入unicode_literals
5 print_function
6
7 import sys
8 import os
9 import re
10
11如果没有(os.environ中的PYTHONIOENCODING)
12和
13 re.search(^ utf-?8 $,os.environ [PYTHONIOENCODING],re.I)):
14 sys.stderr.write(sys.argv [0] +:请设置您的PYTHONIOENCODING可变为utf8\\\

15 sys.exit(1)
16
17打印('1a:el ni\xF1o')
18打印('2a:el nin\\\̃o')
19
20打印('1a:elniño')
21 print('2b:elniño')
22
23 print('1c:el ni\N {拉丁小提琴N with TILDE} o')
24 print('2c:el nin\N {COMBINING TILDE} o')

这里是使用非$ ASCII字符的打印功能 uniquoted ,使用 \\ \\ x {...} 符号:

 %grep -n ^ print / tmp / py | uniquote -x 
17:print('1a:el ni\xF1o')
18:print('2a:el nin\\\̃o')
20:print('1b: $ {$ {$} $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ {{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{
24:print('2c:el nin\N {COMBINING TILDE} o')



演示程序的示例运行



这是一个程序的示例运行,显示了三种不同的方式(a,b,和c)这样做:第一组作为您的源代码中的文字(这将受到StackOverflow的NFC转换,因此不能被信任!!!),而第二组具有数字Unicode代码点并分别使用符号Unicode字符名称 uniquoted ,以便您可以看看真的是什么:

 %python / tmp / py 
1a:elniño
2a: elniño
1b:el niño
2b:elniño
1c:elniño
2c:elniño

%python / tmp / py | uniquote -x
1a:el ni\x {F1} o
2a:el nin\x {303} o
1b:el ni\x {F1} o
2b:el nin\x {303} o
1c:el ni\x {F1} o
2c:el nin\x {303} o

%python / tmp / py | uniquote -v
1a:el ni\N {LATIN SMALL LETTER N WITH TILDE} o
2a:el nin\N {组合TILDE} o
1b:el ni\N {LATIN SMALL LETTER N WITH TILDE} o
2b:el nin\N {组合TILDE} o
1c:el ni\N {拉丁小提琴N with TILDE} o
2c :我不喜欢看二进制,但是这里是看起来像二进制字节:

 %python / tmp / py | uniquote -b 
1a:el ni\xC3\xB1o
2a:el nin\xCC\x83o
1b:el ni\xC3\xB1o
2b :el nin\xCC\x83o
1c:el ni\xC3\xB1o
2c:el nin\xCC\x83o



故事的道德



即使您使用UTF-8源码,您应该考虑并使用只有逻辑的Unicode代码点号(或符号命名的字符),而不是UTF-8的串行表示(或UTF-16的连接表示)的单个8位代码单元)。这是非常罕见的需要代码单位,而不是代码点,它只是混淆你。



如果你使用广泛的Python3,你也会得到更可靠的行为你会得到这些选择的替代品,但这是一个UTF-32的事情,而不是UTF-8。 UTF-32和UTF-8都可以很方便的使用,如果你只是随着流程走。


I have a subprocess command that outputs some characters such as '\xf1'. I'm trying to decode it as utf8 but I get an error.

s = '\xf1'
s.decode('utf-8')

The above throws:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xf1 in position 0: unexpected end of data

It works when I use 'latin-1' but shouldn't utf8 work as well? My understanding is that latin1 is a subset of utf8.

Am I missing something here?

EDIT:

print s # ñ
repr(s) # returns "'\\xa9'"

解决方案

You have confused Unicode with UTF-8. Latin-1 is a subset of Unicode, but it is not a subset of UTF-8. Avoid like the plague ever thinking about individual code units. Just use code points. Do not think about UTF-8. Think about Unicode instead. This is where you are being confused.

Source Code for Demo Program

Using Unicode in Python is very easy. It’s especially with Python 3 and wide builds, the only way I use Python, but you can still use the legacy Python 2 under a narrow build if you are careful about sticking to UTF-8.

To do this, always your source code encoding and your output encoding correctly to UTF-8. Now stop thinking of UTF-anything and use only UTF-8 literals, logical code point numbers, or symbolic character names throughout your Python program.

Here’s the source code with line numbers:

% cat -n /tmp/py
     1  #!/usr/bin/env python3.2
     2  # -*- coding: UTF-8 -*-
     3  
     4  from __future__ import unicode_literals
     5  from __future__ import print_function
     6  
     7  import sys
     8  import os
     9  import re
    10  
    11  if not (("PYTHONIOENCODING" in os.environ)
    12              and
    13          re.search("^utf-?8$", os.environ["PYTHONIOENCODING"], re.I)):
    14      sys.stderr.write(sys.argv[0] + ": Please set your PYTHONIOENCODING envariable to utf8\n")
    15      sys.exit(1)
    16  
    17  print('1a: el ni\xF1o')
    18  print('2a: el nin\u0303o')
    19  
    20  print('1a: el niño')
    21  print('2b: el niño')
    22  
    23  print('1c: el ni\N{LATIN SMALL LETTER N WITH TILDE}o')
    24  print('2c: el nin\N{COMBINING TILDE}o')

And here are print functions with their non-ASCII characters uniquoted using the \x{⋯} notation:

% grep -n ^print /tmp/py | uniquote -x
17:print('1a: el ni\xF1o')
18:print('2a: el nin\u0303o')
20:print('1b: el ni\x{F1}o')
21:print('2b: el nin\x{303}o')
23:print('1c: el ni\N{LATIN SMALL LETTER N WITH TILDE}o')
24:print('2c: el nin\N{COMBINING TILDE}o')

Sample Runs of Demo Program

Here’s a sample run of that program that shows the three different ways (a, b, and c) of doing it: the first set as literals in your source code (which will be subject to StackOverflow’s NFC conversions and so cannot be trusted!!!) and the second two sets with numeric Unicode code points and with symbolic Unicode character names respectively, again uniquoted so you can see what things really are:

% python /tmp/py
1a: el niño
2a: el niño
1b: el niño
2b: el niño
1c: el niño
2c: el niño

% python /tmp/py | uniquote -x
1a: el ni\x{F1}o
2a: el nin\x{303}o
1b: el ni\x{F1}o
2b: el nin\x{303}o
1c: el ni\x{F1}o
2c: el nin\x{303}o

% python /tmp/py | uniquote -v
1a: el ni\N{LATIN SMALL LETTER N WITH TILDE}o
2a: el nin\N{COMBINING TILDE}o
1b: el ni\N{LATIN SMALL LETTER N WITH TILDE}o
2b: el nin\N{COMBINING TILDE}o
1c: el ni\N{LATIN SMALL LETTER N WITH TILDE}o
2c: el nin\N{COMBINING TILDE}o

I really dislike looking at binary, but here is what that looks like as binary bytes:

% python /tmp/py | uniquote -b
1a: el ni\xC3\xB1o
2a: el nin\xCC\x83o
1b: el ni\xC3\xB1o
2b: el nin\xCC\x83o
1c: el ni\xC3\xB1o
2c: el nin\xCC\x83o

The Moral of the Story

Even when you use UTF-8 source, you should think and use only logical Unicode code point numbers (or symbolic named characters), not the individual 8-bit code units that underlie the serial representation of UTF-8 (or for that matter of UTF-16). It is extremely rare to need code units instead of code points, and it just confuses you.

You will also get more reliably behavior if you use a wide build of Python3 than you will get with alternatives to those choices, but that is a UTF-32 matter, not a UTF-8 one. Both UTF-32 and UTF-8 are easy to work with, if you just go with the flow.

这篇关于为什么utf8不能转换这个工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆