Python默认字符串编码 [英] Python default string encoding

查看:125
本文介绍了Python默认字符串编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Python何时,何地以及如何将隐式编码应用于字符串或隐式转码(转换)?



那些默认(即隐式)编码是什么? ?



例如,编码是什么?




  • of

      s =带民族字符的字节字符串 
    us = u带民族字符的Unicode字符串


  • 何时将类型转换为Unicode或从Unicode转换为字节字符串?

      data = unicode(random_byte_string)


  • 当从文件或终端写入字节和Unicode字符串时?

      print(打开( War and Peace.txt的全文).read())



解决方案

此处涉及Python功能的多个部分:读取源代码并解析字符串文字 trans编码打印。每个人都有自己的约定。



简短回答:




  • 代码解析:


    • str (Py2)-不适用,从文件中获取原始字节

    • unicode (Py2)/ str (Py3)-源编码,默认值为 ascii (Py2)和 utf-8 (Py3)

    • bytes (Py3)-无,字面上禁止使用非ASCII字符


  • 出于转码目的:


    • 两者(Py2)- sys.getdefaultencoding()几乎总是)


      • 存在隐式转换,通常会导致 UnicodeDecodeError / UnicodeEncodeError


    • both(Py3)-无,在转换时必须明确指定编码


  • 出于I / O的目的:


    • unicode (Py2)- < file> .encoding (如果已设置),否则 sys.getdefaultencoding()

    • str (Py2)-不适用,写入原始字节

    • str (Py3) -< file> .encoding ,始终设置,默认为 locale.getpreferredencoding()

    • 个字节(Py3)-无,打印 ing产生其 repr()代替







首先,对一些术语进行澄清,以便您正确理解其余内容。 解码是从个字节转换为个字符(Unicode或其他) ,而 encoding (作为一个过程)是逆转。参见每个软件开发人员绝对,肯定必须了解Unicode和字符集的绝对最低要求(无借口!)– Joel on Software 以获得区别。 p>

现在...



读取源代码并解析字符串文字



在源文件的开头,您可以指定文件的源编码 (其确切效果将在后面介绍)。如果未指定,对于Python 2,默认值为 ascii ,对于Python 3,默认值为 utf-8 。UTF-8 BOM与 utf-8 编码声明具有相同的效果。



Python 2



Python 2将源作为原始字节读取。看到Unicode文字时,它仅使用源编码来解析Unicode文字。 ((比幕后要复杂,但是

 >输入t.py 
#encoding:cp1251
s =абвгд
us = uабвгд
打印repr,repr(us)
> py -2 t.py
'\xe0\xe1\xe2\xe3\xe4'u'\u0430\u0431\u0432\u0433\u0434'

<将文件中的编码声明更改为cp866,请勿更改内容>
> py -2 t.py
'\xe0\xe1\xe2\xe3\xe4'u'\u0440\u0441\u0442\u0443\u0444'

<将文件转码为utf-8,更新声明或用BOM替换>
> py -2 t.py
'\xd0\xb0\xd0\xb1\xd0\xb2\xd0\xb3\xd0\xb4'u'\u0430\u0431 $ u0432\u0433\u0434'

因此,常规字符串将包含确切的字符串并且 Unicode字符串将包含使用源编码对文件的字节进行解码的结果。



如果解码失败,您将收到 SyntaxError 。如果在未指定编码的情况下文件中包含非ASCII字符,则相同。最后,如果 unicode_literals 将来,则使用任何常规字符串文字(



Python 3



在解析时被视为Unicode文字。

Python 3使用源编码将整个源文件解码为Unicode字符序列。此后将进行任何解析。 (特别是,这使得在标识符中包含Unicode成为可能。)由于所有字符串文字现在都是Unicode,因此不需要其他转码。在字节文字中,禁止使用非ASCII字符(此类字节必须使用转义序列指定),从而完全避免了该问题。



转码



按照开始时的说明:




  • str (Py2)/ 字节(Py3)-字节 =>只能解码 d (直接,即;详细信息如下)

  • unicode (Py2)/ str (Py3)-字符 =>只能编码 d



Python 2



在两种情况下,如果未指定编码,则 sys.getdefaultencoding()。它是 ascii (除非您取消注释 site.py 中的代码块,或进行其他一些黑客攻击,其中是灾难的秘诀)。因此,出于代码转换的目的, sys.getdefaultencoding()是字符串的默认编码。



现在,这是一个警告:




  • a decode() encode()(具有默认编码)是在转换 str<-> unicode




    • 字符串格式( UnicodeDecodeError的三分之一 / UnicodeEncodeError 有关此的问题)

    • 当尝试 encode()一个 str decode()一个 unicode (SO问题的第二个三分之一)




Python 3



根本没有默认编码: str bytes之间的隐式转换现在被禁止。




  • 字节只能被解码 d和 str -编码 d,以及编码参数是强制性的。

  • 转换 bytes-> str (隐含)会产生其 repr()(仅对调试打印有用),完全避免了编码问题

  • 转换 str->> bytes 禁止



打印



这件事与变量的无关值,但与 print ed时在屏幕上看到的内容有关-以及是否会得到 UnicodeEncodeError 打印时。



Python 2




  • A unicode encode d和< file> .encoding (如果已设置);否则,将其隐式转换为 str 。 ( UnicodeEncodeError SO问题的最后三分之一属于此处。)


    • 对于标准流,流的编码是在启动时从各种特定于环境的来源中猜测出来的,可以用 PYTHONIOENCODING envvar覆盖。


  • str 的字节原样发送到OS流。在屏幕上看到的具体字形取决于终端的编码设置(如果类似UTF-8,则如果打印无效的UTF-8字节序列,您可能什么也看不到)。



Python 3



更改为:




  • 现在用文本打开的文件与二进制模式本地接受<$ c相应地,$ c> str 或 bytes ,然后直接拒绝处理错误的类型。文本模式文件始终设置为编码 locale.getpreferredencoding(False)为默认

  • <$ c文本流的$ c> print 仍将所有内容隐式转换为 str ,对于 bytes 按照上述方法打印其 repr(),完全避免了编码问题


When, where and how does Python implicitly apply encodings to strings or does implicit transcodings (conversions)?

And what those "default" (i.e. implied) encodings are?

For example, what are the encodings:

  • of string literals?

    s = "Byte string with national characters"
    us = u"Unicode string with national characters"
    

  • of byte strings when type-converted to and from Unicode?

    data = unicode(random_byte_string)
    

  • when byte- and Unicode strings are written to/from a file or a terminal?

    print(open("The full text of War and Peace.txt").read())
    

解决方案

There are multiple parts of Python's functionality involved here: reading the source code and parsing the string literals, transcoding, and printing. Each has its own conventions.

Short answer:

  • For the purpose of code parsing:
    • str(Py2) -- not applicable, raw bytes from the file are taken
    • unicode(Py2)/str(Py3) -- "source encoding", defaults are ascii(Py2) and utf-8(Py3)
    • bytes(Py3) -- none, non-ascii characters are prohibited in the literal
  • For the purpose of transcoding:
    • both(Py2) -- sys.getdefaultencoding() (ascii almost always)
      • there are implicit conversions which often result in a UnicodeDecodeError/UnicodeEncodeError
    • both(Py3) -- none, must specify encoding explicitly when converting
  • For the purpose of I/O:
    • unicode(Py2) -- <file>.encoding if set, otherwise sys.getdefaultencoding()
    • str(Py2) -- not applicable, raw bytes are written
    • str(Py3) -- <file>.encoding, always set and defaults to locale.getpreferredencoding()
    • bytes(Py3) -- none, printing produces its repr() instead

First of all, some terminology clarification so that you understand the rest correctly. Decoding is translation from bytes to characters (Unicode or otherwise), and encoding (as a process) is the reverse. See The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) – Joel on Software to get the distinction.

Now...

Reading the source and parsing string literals

At the start of a source file, you can specify the file's "source encoding" (its exact effect is decribed later). If not specified, the default is ascii for Python 2 and utf-8 for Python 3. A UTF-8 BOM has the same effect as a utf-8 encoding declaration.

Python 2

Python 2 reads the source as raw bytes. It only uses the "source encoding" to parse a Unicode literal when it sees one. (It's more complicated than that under the hood but this is the net effect.)

> type t.py
#encoding: cp1251
s = "абвгд"
us = u"абвгд"
print repr(s), repr(us)
> py -2 t.py
'\xe0\xe1\xe2\xe3\xe4' u'\u0430\u0431\u0432\u0433\u0434'

<change encoding declaration in the file to cp866, do not change the contents>
> py -2 t.py
'\xe0\xe1\xe2\xe3\xe4' u'\u0440\u0441\u0442\u0443\u0444'

<transcode the file to utf-8, update declaration or replace with BOM>
> py -2 t.py
'\xd0\xb0\xd0\xb1\xd0\xb2\xd0\xb3\xd0\xb4' u'\u0430\u0431\u0432\u0433\u0434'    

So, regular strings will contain the exact bytes that are in the file. And Unicode strings will contain the result of decoding the file's bytes with the "source encoding".

If the decoding fails, you will get a SyntaxError. Same if there is a non-ascii character in the file when there's no encoding specified. Finally, if unicode_literals future is used, any regular string literals (in that file only) are treated as Unicode literals when parsing, with all what that means.

Python 3

Python 3 decodes the entire source file with the "source encoding" into a sequence of Unicode characters. Any parsing is done after that. (In particular, this makes it possible to have Unicode in identifiers.) Since all string literals are now Unicode, no additional transcoding is needed. In byte literals, non-ascii characters are prohibited (such bytes must be specified with escape sequences), evading the issue altogether.

Transcoding

As per the clarification at the start:

  • str(Py2)/bytes(Py3) -- bytes => can only be decoded (directly, that is; details follow)
  • unicode(Py2)/str(Py3) -- characters => can only be encoded

Python 2

In both cases, if the encoding is not specified, sys.getdefaultencoding() is used. It is ascii (unless you uncomment a code chunk in site.py, or do some other hacks which are a recipe for disaster). So, for the purpose of transcoding, sys.getdefaultencoding() is the "string's default encoding".

Now, here's a caveat:

  • a decode() and encode() -- with the default encoding -- is done implicitly when converting str<->unicode:

    • in string formatting (a third of UnicodeDecodeError/UnicodeEncodeError questions on SO are about this)
    • when trying to encode() a str or decode() a unicode (the 2nd third of the SO questions)

Python 3

There's no "default encoding" at all: implicit conversion between str and bytes is now prohibited.

  • bytes can only be decoded and str -- encoded, and the encoding argument is mandatory.
  • converting bytes->str (incl. implicitly) produces its repr() instead (which is only useful for debug printing), evading the encoding issue entirely
  • converting str->bytes is prohibited

Printing

This matter is unrelated to a variable's value but related to what you would see on the screen when it's printed -- and whether you will get a UnicodeEncodeError when printing.

Python 2

  • A unicode is encoded with <file>.encoding if set; otherwise, it's implicitly converted to str as per the above. (The final third of the UnicodeEncodeError SO questions fall into here.)
    • For standard streams, the stream's encoding is guessed at startup from various environment-specific sources, and can be overridden with the PYTHONIOENCODING envvar.
  • str's bytes are sent to the OS stream as-is. What specific glyphs you will see on the screen depends on your terminal's encoding settings (if it's something like UTF-8, you may see nothing at all if you print a byte sequence that is invalid UTF-8).

Python 3

The changes are:

  • Now files opened with text vs binary mode natively accept str or bytes, correspondingly, and outright refuse to process the wrong type. Text-mode files always have an encoding set, locale.getpreferredencoding(False) being the default.
  • print for text streams still implicitly converts everything to str, which in the case of bytes prints its repr() as per the above, evading the encoding issue altogether

这篇关于Python默认字符串编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆