Python,Windows控制台和编码(cp 850 vs cp1252) [英] Python, windows console and encodings (cp 850 vs cp1252)

查看:186
本文介绍了Python,Windows控制台和编码(cp 850 vs cp1252)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我以为我知道有关编码和Python的一切,但是今天我遇到了一个奇怪的问题:尽管控制台设置为代码页850,而Python报告正确 - 我把命令行的参数似乎编码在代码页1252.如果我尝试用sys.stdin.encoding解码它,我得到错误的结果。如果我假设'cp1252',忽略什么sys.stdout.encoding报告,它是有效的。



我错过了一些东西,还是Python中的错误? Windows?注意:我在Windows 7 EN上运行Python 2.6.6,区域设置为法语(瑞士)。



在下面的测试程序中,我检查文字是否正确解释并可以打印 - 这是工作。但是我在命令行中传递的所有值似乎都被错误地编码:

 #!/ usr / bin / python 
# - * - encoding:utf-8 - * -
import sys

literal_mb ='utf-8 literal:üèéÃÂç€ÈÚ'
literal_u = u'unicode literal:
print测试文字
print literal_mb.decode('utf-8')。encode(sys.stdout.encoding,'replace')
print literal_u.encode sys.stdout.encoding,'replace')

print测试参数(stdin / out编码:,sys.stdin.encoding,/,sys.stdout.encoding,)
for i in range(1,len(sys.argv)):
arg = sys.argv [i]
printarg,i,:,arg
for ch in arg:
print,ch, - >,ord(ch),
if ord(ch)> = 128和sys.stdin.encoding =='cp850' :
print< - ,ch.decode('cp1252')。encode(sys.stdout.encoding,'replace'),[假设输入实际上是cp1252]
else:
print

在新创建的控制台中,运行

  C:\dev> test-encoding.pyabcé€

我得到以下输出

 测试文字
utf-8 literal:üèéÃÂç?ÈÚ
unicode文字:üèéÃÂç?ÈÚ
测试参数(stdin / out编码:cp850 / cp850)
arg 1:abcÚÇ
a - > 97
b - > 98
c - > 99
Ú - > 233< - é[假设输入实际上是cp1252]
Ç - > 128 < - [假设输入实际上是cp1252]

而我期望第四个字符的序数值为 130 而不是233(请参阅代码页面 850 1252 )。



注意:欧元符号的128值是一个谜,因为cp850没有。否则,'?'是预期的 - cp850无法打印字符,我在转换中使用了'replace'。



如果我更改代码页通过发出 chcp 1252 并运行相同的命令,我(正确地)获得

 测试字面值
utf-8 literal:üèéÃÂç€ÈÚ
unicode literal:üèéÃÂç€ÈÚ
测试参数(stdin / out编码:cp1252 / cp1252)
arg 1:abcé€
a - > 97
b - > 98
c - > 99
é - > 233
€ - > 128

任何想法我失踪了?



编辑1:我刚刚通过读取sys.stdin进行测试。这可以像预期的那样工作:在cp850中,键入'é'会产生130的顺序值。所以这个问题只适用于命令行。那么命令行的处理方式与标准输入方式不同?



编辑2:似乎我的关键字错误。我发现了另一个非常接近的话题:在Windows中的Python 2.x中的命令行参数中读取Unicode字符。但是,如果命令行不像sys.stdin那样被编码,并且由于sys.getdefaultencoding()报告'ascii',似乎没有办法知道它的实际编码。我发现使用win32扩展的答案很奇怪。

解决方案

回复自己:



在Windows上,控制台使用的编码(因此,sys.stdin / out的编码)与各种OS提供的字符串的编码不同 - 通过例如os.getenv(),sys.argv,当然还有更多。



由sys.getdefaultencoding()提供的编码真的是 - 由Python开发人员选择的默认以匹配在极端情况下解释器使用的最合理的编码。我在Python 2.6中得到'ascii',并尝试使用可移植的Python 3.1,它产生'utf-8'。两者都不是我们正在寻找的 - 它们只是编码转换功能的后备。



As 此页面似乎声明,OS提供的字符串使用的编码由Active代码页(ACP)。由于Python没有本地函数来检索它,所以我不得不使用ctypes:

 从ctypes导入cdll 
os_encoding ='cp'+ str(cdll.kernel32.GetACP())

编辑: 但是正如Jacek所说,实际上有一种更强大和更好的Pythonic方法(语义将需要验证,但直到证明是错误的,我会使用这个)

  import locale 
os_encoding = locale.getpreferredencoding()
#这会在我的系统上返回'cp1252',呀!

然后

 code> u_argv = [x.decode(os_encoding)for x in sys.argv] 
u_env = os.getenv('myvar')。decode(os_encoding)

在我的系统上, os_encoding ='cp1252',所以它的工作原理。我相当肯定这会在其他平台上崩溃,所以请随时编辑并使其更为通用。我们当然需要Windows报告的ACP和Python编码名称之间的某种转换表,这比只是前缀cp更好。



这是一个不幸的是黑客,虽然我发现它比这个ActiveState代码食谱(与我的问题编辑2中提到的SO问题相关联)。我在这里看到的优点是可以应用于os.getenv(),而不仅仅适用于sys.argv。


I thought I knew everything about encodings and Python, but today I came across a weird problem: although the console is set to code page 850 - and Python reports it correctly - parameters I put on the command line seem to be encoded in code page 1252. If I try to decode them with sys.stdin.encoding, I get the wrong result. If I assume 'cp1252', ignoring what sys.stdout.encoding reports, it works.

Am I missing something, or is this a bug in Python ? Windows ? Note: I am running Python 2.6.6 on Windows 7 EN, locale set to French (Switzerland).

In the test program below, I check that literals are correctly interpreted and can be printed - this works. But all values I pass on the command line seem to be encoded wrongly:

#!/usr/bin/python
# -*- encoding: utf-8 -*-
import sys

literal_mb = 'utf-8 literal:   üèéÃÂç€ÈÚ'
literal_u = u'unicode literal: üèéÃÂç€ÈÚ'
print "Testing literals"
print literal_mb.decode('utf-8').encode(sys.stdout.encoding,'replace')
print literal_u.encode(sys.stdout.encoding,'replace')

print "Testing arguments ( stdin/out encodings:",sys.stdin.encoding,"/",sys.stdout.encoding,")"
for i in range(1,len(sys.argv)):
    arg = sys.argv[i]
    print "arg",i,":",arg
    for ch in arg:
        print "  ",ch,"->",ord(ch),
        if ord(ch)>=128 and sys.stdin.encoding == 'cp850':
            print "<-",ch.decode('cp1252').encode(sys.stdout.encoding,'replace'),"[assuming input was actually cp1252 ]"
        else:
            print ""

In a newly created console, when running

C:\dev>test-encoding.py abcé€

I get the following output

Testing literals
utf-8 literal:   üèéÃÂç?ÈÚ
unicode literal: üèéÃÂç?ÈÚ
Testing arguments ( stdin/out encodings: cp850 / cp850 )
arg 1 : abcÚÇ
   a -> 97
   b -> 98
   c -> 99
   Ú -> 233 <- é [assuming input was actually cp1252 ]
   Ç -> 128 <- ? [assuming input was actually cp1252 ]

while I would expect the 4th character to have an ordinal value of 130 instead of 233 (see the code pages 850 and 1252).

Notes: the value of 128 for the euro symbol is a mystery - since cp850 does not have it. Otherwise, the '?' are expected - cp850 cannot print the characters and I have used 'replace' in the conversions.

If I change the code page of the console to 1252 by issuing chcp 1252 and run the same command, I (correctly) obtain

Testing literals
utf-8 literal:   üèéÃÂç€ÈÚ
unicode literal: üèéÃÂç€ÈÚ
Testing arguments ( stdin/out encodings: cp1252 / cp1252 )
arg 1 : abcé€
   a -> 97
   b -> 98
   c -> 99
   é -> 233
   € -> 128

Any ideas what I'm missing ?

Edit 1: I've just tested by reading sys.stdin. This works as expected: in cp850, typing 'é' results in an ordinal value of 130. So the problem is really for the command line only. So, is the command line treated differently than the standard input ?

Edit 2: It seems I had the wrong keywords. I found another very close topic on SO: Read Unicode characters from command-line arguments in Python 2.x on Windows. Still, if the command line is not encoded like sys.stdin, and since sys.getdefaultencoding() reports 'ascii', it seems there is no way to know its actual encoding. I find the answer using win32 extensions pretty hacky.

解决方案

Replying to myself:

On Windows, the encoding used by the console (thus, that of sys.stdin/out) differs from the encoding of various OS-provided strings - obtained through e.g. os.getenv(), sys.argv, and certainly many more.

The encoding provided by sys.getdefaultencoding() is really that - a default, chosen by Python developers to match the "most reasonable encoding" the interpreter use in extreme cases. I get 'ascii' on my Python 2.6, and tried with portable Python 3.1, which yields 'utf-8'. Both are not what we are looking for - they are merely fallbacks for encoding conversion functions.

As this page seems to state, the encoding used by OS-provided strings is governed by the Active Code Page (ACP). Since Python does not have a native function to retrieve it, I had to use ctypes:

from ctypes import cdll
os_encoding = 'cp' + str(cdll.kernel32.GetACP())

Edit: But as Jacek suggests, there actually is a more robust and Pythonic way to do it (semantics would need validation, but until proven wrong, I'll use this)

import locale
os_encoding = locale.getpreferredencoding()
# This returns 'cp1252' on my system, yay!

and then

u_argv = [x.decode(os_encoding) for x in sys.argv]
u_env = os.getenv('myvar').decode(os_encoding)

On my system, os_encoding = 'cp1252', so it works. I am quite certain this would break on other platforms, so feel free to edit and make it more generic. We would certainly need some kind of translation table between the ACP reported by Windows and the Python encoding name - something better than just prepending 'cp'.

This is a unfortunately a hack, although I find it a bit less intrusive than the one suggested by this ActiveState Code Recipe (linked to by the SO question mentioned in Edit 2 of my question). The advantage I see here is that this can be applied to os.getenv(), and not only to sys.argv.

这篇关于Python,Windows控制台和编码(cp 850 vs cp1252)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆