Python 2假定不同的源代码编码 [英] Python 2 assumes different source code encodings
问题描述
我注意到,在没有源代码编码声明的情况下,Python 2解释器假定源代码使用脚本和标准输入 进行ASCII编码:
I noticed that without source code encoding declaration, the Python 2 interpreter assumes the source code is encoded in ASCII with scripts and standard input:
$ python test.py # where test.py holds the line: print u'é'
File "test.py", line 1
SyntaxError: Non-ASCII character '\xc3' in file test.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
$ echo "print u'é'" | python
File "/dev/fd/63", line 1
SyntaxError: Non-ASCII character '\xc3' in file /dev/fd/63 on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
,并使用-m
module 和-c
command 标志在ISO-8859-1中进行了编码:
and it is encoded in ISO-8859-1 with the -m
module and -c
command flags:
$ python -m test # where test.py holds the line: print u'é'
é
$ python -c "print u'é'"
é
记录在哪里?
将此与Python 3进行对比,后者始终假定源代码采用UTF-8编码,因此在四种情况下均显示é
.
Contrast this to Python 3 which always assumes the source code is encoded in UTF-8 and thus prints é
in the four cases.
注意. –我在控制台编码设置为UTF-8的macOS 10.13和Ubuntu Linux 17.10的CPython 2.7.14上进行了测试.
推荐答案
-c
和-m
开关,最终(*)运行 compile()
函数,两者均采用Latin-1源代码:
The -c
and -m
switches, ultimately(*) run the code supplied with the exec
statement or the compile()
function, both of which take Latin-1 source code:
第一个表达式的计算结果应为Unicode字符串, Latin-1 编码的字符串,打开的文件对象,代码对象或元组.
The first expression should evaluate to either a Unicode string, a Latin-1 encoded string, an open file object, a code object, or a tuple.
这没有记录,它是一个实现细节,可能会也可能不会被视为错误.
This is not documented, it's an implementation detail, that may or may not be considered a bug.
我不认为这是值得修复的东西,而Latin-1是ASCII的超集,因此损失不大. Python 3中已经清理了如何处理-c
和-m
中的代码,并且在那里更加一致.与-c
一起传递的代码将使用当前语言环境进行解码,并且像往常一样,使用-m
开关加载的模块默认为UTF-8.
I don't think it is something that is worth fixing however, and Latin-1 is a superset of ASCII so little is lost. How code from -c
and -m
is handled has been cleaned up in Python 3 and is much more consistent there; code passed in with -c
is decoded using the current locale, and modules loaded with the -m
switch default to UTF-8, as usual.
(*) If you want to know the exact implementations used, start at the Py_Main()
function in Modules/main.c
, which handles both -c
and -m
as:
if (command) {
sts = PyRun_SimpleStringFlags(command, &cf) != 0;
free(command);
} else if (module) {
sts = RunModule(module, 1);
free(module);
}
-
-c
通过PyRun_SimpleStringFlags()
函数执行,依次调用PyRun_StringFlags()
一个>.当您使用exec
时,也会将一个字节字符串对象传递给PyRun_StringFlags()
,然后假定源代码包含拉丁1编码的字节. -
-m
使用函数 ,以将模块名称传递给runpy
模块中的私有函数_run_module_as_main()
,它使用pkgutil.get_loader()
加载模块元数据,并使用 PEP 302加载程序上的loader.get_code()
函数获取模块代码对象;如果没有可用的缓存字节码,则代码对象通过使用compile()
函数并将模式设置为exec
来生成. -c
is executed through thePyRun_SimpleStringFlags()
function, which in turn callsPyRun_StringFlags()
. When you useexec
a bytestring object is passed toPyRun_StringFlags()
too, and the source code is then assumed to contain Latin-1-encoded bytes.-m
uses theRunModule()
function to pass the module name to the private function_run_module_as_main()
in therunpy
module, which usespkgutil.get_loader()
to load the module metadata, and fetches the module code object with theloader.get_code()
function on the PEP 302 loader; if no cached bytecode is available then the code object is produced by using thecompile()
function with the mode set toexec
.
这篇关于Python 2假定不同的源代码编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!