为什么 sys.getdefaultencoding() 与 sys.stdout.encoding 不同,这如何破坏 Unicode 字符串? [英] Why is sys.getdefaultencoding() different from sys.stdout.encoding and how does this break Unicode strings?

查看:31
本文介绍了为什么 sys.getdefaultencoding() 与 sys.stdout.encoding 不同,这如何破坏 Unicode 字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我花了几个小时寻找 Unicode 字符串的问题,该问题被分解为 Python (2.7) 对我隐藏的东西,但我仍然不明白.首先,我尝试在我的代码中一致地使用 u".." 字符串,但这导致了臭名昭著的 UnicodeEncodeError.我尝试使用 .encode('utf8'),但这也无济于事.最后,事实证明我不应该使用它们,并且这一切都会自动进行.然而,我(在这里我需要感谢帮助我的朋友)在用头撞墙时确实注意到了一些奇怪的事情.sys.getdefaultencoding() 返回 ascii,而 sys.stdout.encoding 返回 UTF-8.1. 在下面的代码中工作正常,无需对 sys 进行任何修改,2. 引发 UnicodeEncodeError.如果我使用 reload(sys).setdefaultencoding("utf8") 更改默认系统编码,则 2. 工作正常.我的问题是为什么这两个编码变量首先不同,我如何设法在这段简单的代码中使用错误的编码?请不要把我送到 Unicode HOWTO,我已经读过很明显在关于UnicodeEncodeError的几十个问题中.

I spent a few angry hours looking for the problem with Unicode strings that was broken down to something that Python (2.7) hides from me and I still don't understand. First, I tried to use u".." strings consistently in my code, but that resulted in the infamous UnicodeEncodeError. I tried using .encode('utf8'), but that didn't help either. Finally, it turned out I shouldn't use either and it all works out automagically. However, I (here I need to give credit to a friend who helped me) did notice something weird while banging my head against the wall. sys.getdefaultencoding() returns ascii, while sys.stdout.encoding returns UTF-8. 1. in the code below works fine without any modifications to sys and 2. raises a UnicodeEncodeError. If I change the default system encoding with reload(sys).setdefaultencoding("utf8"), then 2. works fine. My question is why the two encoding variables are different in the first place and how do I manage to use the wrong encoding in this simple piece of code? Please, don't send me to the Unicode HOWTO, I've read that obviously in the tens of questions about UnicodeEncodeError.

#  -*- coding: utf-8 -*-
import sys


class Token:
    def __init__(self, string, final=False):
        self.value = string
        self.final = final

    def __str__(self):
        return self.value

    def __repr__(self):
        return self.value

print(sys.getdefaultencoding())
print(sys.stdout.encoding)

# 1.
myString = "I need 20 000€."
tok = Token(myString)
print(tok)

reload(sys).setdefaultencoding("utf8")

# 2.
myString = u"I need 20 000€."
tok = Token(myString)
print(tok)

推荐答案

我的问题是为什么这两个编码变量首先不同

My question is why the two encoding variables are different in the first place

它们有不同的用途.

sys.stdout.encoding 应该是你的终端用来解释文本的编码,否则你可能会在输出中得到 mojibake.在一种环境中可能是 utf-8,在另一种环境中可能是 cp437,等等.

sys.stdout.encoding should be the encoding that your terminal uses to interpret text otherwise you may get mojibake in the output. It may be utf-8 in one environment, cp437 in another, etc.

sys.getdefaultencoding() 在 Python 2 上用于隐式转换(当未明确设置编码时),即,Python 2 可能将仅 ascii 的字节串和 Unicode 字符串混合在一起,例如,xml.etree.ElementTree 将 ascii 范围内的文本存储为字节串或 json.dumps()在 Python 2 中返回仅 ascii 的字节串而不是 Unicode —— 可能是由于性能 —— 字节比 Unicode 在表示 ascii 字符方面便宜.Python 3 中禁止隐式转换.

sys.getdefaultencoding() is used on Python 2 for implicit conversions (when the encoding is not set explicitly) i.e., Python 2 may mix ascii-only bytestrings and Unicode strings together e.g., xml.etree.ElementTree stores text in ascii range as bytestrings or json.dumps() returns an ascii-only bytestring instead of Unicode in Python 2 — perhaps due to performance — bytes were cheaper than Unicode for representing ascii characters. Implicit conversions are forbidden in Python 3.

sys.getdefaultencoding() 在 Python 2 的所有系统上总是 'ascii' 除非你覆盖它你不应该这样做否则它可能会隐藏错误和你的数据由于使用可能错误的数据编码进行隐式转换,因此可能很容易损坏.

sys.getdefaultencoding() is always 'ascii' on all systems in Python 2 unless you override it that you should not do otherwise it may hide bugs and your data may be easily corrupted due to the implicit conversions using a possibly wrong encoding for the data.

顺便说一句,还有另一种常见的编码 sys.getfilesystemencoding() 可能与两者不同.sys.getfilesystemencoding() 应该是用于编码操作系统数据(文件名、命令行参数、环境变量)的编码.

btw, there is another common encoding sys.getfilesystemencoding() that may be different from the two. sys.getfilesystemencoding() should be the encoding that is used to encode OS data (filenames, command-line arguments, environment variables).

使用 # -*- coding: utf-8 -*- 声明的源代码编码可能与所有已经提到的编码不同.

The source code encoding declared using # -*- coding: utf-8 -*- may be different from all of the already-mentioned encodings.

当然,如果您从文件、网络中读取数据;它可能使用与上述不同的字符编码,例如,如果在记事本中创建的文件使用 Windows ANSI 编码保存,例如 cp1252 那么在另一个系统上,所有标准编码都可能与其不同.

Naturally, if you read data from a file, network; it may use character encodings different from the above e.g., if a file created in notepad is saved using Windows ANSI encoding such as cp1252 then on another system all the standard encodings can be different from it.

重点是:由于与 Python 无关的原因,可能有 多种 编码,为了避免头痛,使用 Unicode 表示文本:尽快转换编码文本在输入时转换为 Unicode,并在输出时尽可能晚地将其编码为字节(可能使用不同的编码)——这就是所谓的 Unicode 三明治.

The point being: there could be multiple encodings for reasons unrelated to Python and to avoid the headache, use Unicode to represent text: convert as soon as possible encoded text to Unicode on input, and encode it to bytes (possibly using a different encoding) as late as possible on output — this is so called the concept of Unicode sandwich.

如何在这段简单的代码中使用错误的编码?

how do I manage to use the wrong encoding in this simple piece of code?

  1. 你的第一个代码示例不好.您在 Python 2 上的字节字符串中使用了不应该使用的非 ascii 文字字符.仅将字节串的文字用于二进制数据(或必要时称为本机字符串).如果您在任何不使用 utf-8 兼容编码的环境(例如 Windows 控制台)中使用 Python 2 运行该代码,则该代码可能会产生 mojibake,例如 I need 20 000Γé¼.(注意字符噪声)

第二个代码示例没有问题,假设 reload(sys) 不是其中的一部分.如果您不想在所有字符串文字前加上 u'';你可以使用 from __future__ import unicode_literals

The second code example is ok assuming reload(sys) is not part of it. If you don't want to prefix all string literals with u''; you could use from __future__ import unicode_literals

您的实际问题是 UnicodeEncodeError 错误,reload(sys) 不是正确的解决方案!
正确的解决方案是在 POSIX 上正确配置您的区域设置(LANGLC_CTYPE)设置 PYTHONIOENCODING envvar 如果输出被重定向到管道/文件或安装 win-unicode-console 将 Unicode 打印到 Windows 控制台.

Your actual issue is UnicodeEncodeError error and reload(sys) is not the right solution!
The correct solution is to configure your locale properly on POSIX (LANG, LC_CTYPE) or set PYTHONIOENCODING envvar if the output is redirected to a pipe/file or install win-unicode-console to print Unicode to Windows console.

这篇关于为什么 sys.getdefaultencoding() 与 sys.stdout.encoding 不同,这如何破坏 Unicode 字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆