chcp 65001 codepage导致程序终止,没有任何错误 [英] chcp 65001 codepage results in program termination without any error

查看:1305
本文介绍了chcp 65001 codepage导致程序终止,没有任何错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题

当我想在Python解释器中输入输入 Unicode字符时出现问题(为了简单起见,我在示例中使用了a-umlaut ,但我第一次遇到这样的Farsi字符)。每当我使用python与 chcp 65001 代码页,然后尝试输入甚至一个Unicode字符,Python退出没有任何错误。



我花了几天试图解决这个问题,无济于事。但是今天,我在 python网站上找到了一个主题,另一个在 MySQL 和另一个关于Lua用户的问题提出关于这个突然退出,虽然没有任何解决方案和一些说, chcp 65001 本质上是破碎的。



这将是很好知道一次是否这个问题是chcp设计相关



<$ c $

c> chcp 65001


Python 3.X:


Python shell



print('ä')



结果:它只是退出shell



> python.exe -cprint('ä')
以及: print('\\\ä')



结果:Luajit2.0.4中的


$ b

print('ä')



result:它只是退出shell



但是这是可行的: print('\xc3\xa4')



我想出了这样的观察:

    $ b

$ b $ b b
$ b

所以
这不是一个Python错误,我们不能直接在CLI程序中使用Unicode字符Windows命令提示符或其任何包装器,如Conemu,Cmder(我使用Cmder能够看到和使用Unicode字符在Windows shell和我这样做没有任何问题)。这是正确的吗?

解决方案

要在Windows控制台中为Python 2.7和3.x并启用 win_unicode_console 。此操作使用 ReadConsoleW 等宽字符功能; WriteConsoleW ,就像其他支持Unicode的控制台一样程序,如cmd.exe和powershell.exe。对于Python 3.6,添加了一个新的 io._WindowsConsoleIO 原始I / O类。它读取和写入UTF-8编码文本(与跨平台兼容性Unix - 获取一个字节 - 程序),但在内部它使用宽字符API通过转码和从UTF-16LE。



使用非ASCII输入所遇到的问题可在控制台中重现,包括Windows 10在内的所有Windows版本。控制台主机进程,即conhost。 exe,不是为UTF-8设计的(代码页65001),并且没有更新以支持它。特别是,非ASCII输入会导致空的读取。这反过来又导致Python的REPL退出并内置输入来引发 EOFError



问题是,conhost编码其UTF-16输入缓冲区,假设单字节代码页,例如西方语言环境中的OEM和ANSI代码页(例如437,850, 1252)。 UTF-8是一种多字节编码,其中非ASCII字符编码为2到4个字节。要处理UTF-8,它需要以多个重复的 M / 4 字符进行编码,其中M是N字节缓冲区中剩余的字节。相反,它假定读取N个字节的请求是读取N个字符的请求。然后,如果输入有一个或多个非ASCII字符,内部 WideCharToMultiByte 调用由于缓冲区不足而失败,并且控制台返回0个字节的成功读取。



如果安装了pyreadline模块,您可能无法在Python 3.5中观察到这个问题。 Python 3.5自动尝试导入 readline 。在pyreadline的情况下,通过宽字符函数 ReadConsoleInputW 。这是一个读取控制台输入记录的低级函数。原则上它应该工作,但实际上输入 print('ä')被REPL读取为 print('')。对于非ASCII字符, ReadConsoleInputW 返回一系列Alt + Numpad KEY_EVENT 记录。该序列是有损OEM编码,除了最后一条记录,它在 UnicodeChar 字段中有输入字符,可以忽略。显然,pyreadline忽略了整个序列。



在Windows 8之前,使用代码页65001的输出也被破坏。它按照非ASCII字符数量比例打印垃圾文本的踪迹。在这种情况下,问题是 WriteFile WriteConsoleA 不正确地返回写入屏幕缓冲区的UTF-而不是UTF-8字节数。这混淆了Python的缓冲写入器,导致重复写入它认为是剩余未写入的字节。此问题在Windows 8中已修复,作为重写内部控制台API以使用ConDrv设备而不是LPC端口的一部分。旧版本的Windows可以使用ConEmu或ANSICON解决此错误。


Problem
The problem arises when I want to input Unicode character in Python interpreter (for simplicity I have used a-umlaut in the example, but I have first encountered this for Farsi characters). Whenever I use python with chcp 65001 code page and then try to input even one Unicode character, Python exits without any error.

I have spent days trying to solve this problem to no avail. But today, I found a thread on python website, another on MySQL and another on Lua-users which issues were raised regarding this sudden exit, although without any solution and some saying that chcp 65001 is inherently broken.

It would be good to know once and for all whether this problem is chcp-design-related or there is a possible workaround.

Reproduce Error

chcp 65001

Python 3.X:

Python shell

print('ä')

result: it just exits the shell

however, this works python.exe -c "print('ä')" and also this : print('\u00e4')

result: ä

in Luajit2.0.4

print('ä')

result: it just exits the shell

however this works: print('\xc3\xa4')

I have come up with this observation so far:

  1. direct output with the command prompt works.
  2. Unicode-based , hex-based equivalent of the character works.

So This is not a Python bug and that we can't use a Unicode character directly in CLI programs in Windows command prompt or any of its Wrapper like Conemu, Cmder (I am using Cmder to be able to see and use Unicode character in Windows shell and I have done so without any problem). Is this correct?

解决方案

To use Unicode in the Windows console for Python 2.7 and 3.x (prior to 3.6), install and enable win_unicode_console. This uses the wide-character functions ReadConsoleW and WriteConsoleW, just like other Unicode-aware console programs such as cmd.exe and powershell.exe. For Python 3.6, a new io._WindowsConsoleIO raw I/O class has been added. It reads and writes UTF-8 encoded text (for cross-platform compatibility with Unix -- "get a byte" -- programs), but internally it uses the wide-character API by transcoding to and from UTF-16LE.

The problem you're experiencing with non-ASCII input is reproducible in the console for all Windows versions up to and including Windows 10. The console host process, i.e. conhost.exe, wasn't designed for UTF-8 (codepage 65001) and hasn't been updated to support it consistently. In particular, non-ASCII input causes an empty read. This in turn causes Python's REPL to exit and built-in input to raise EOFError.

The problem is that conhost encodes its UTF-16 input buffer assuming a single-byte codepage, such as the OEM and ANSI codepages in Western locales (e.g. 437, 850, 1252). UTF-8 is a multibyte encoding in which non-ASCII characters are encoded as 2 to 4 bytes. To handle UTF-8 it would need to encode in multiple iterations of M / 4 characters, where M is the remaining bytes available from the N-byte buffer. Instead it assumes a request to read N bytes is a request to read N characters. Then if the input has one or more non-ASCII characters, the internal WideCharToMultiByte call fails due to an undersized buffer, and the console returns a 'successful' read of 0 bytes.

You may not observe exactly this problem in Python 3.5 if the pyreadline module is installed. Python 3.5 automatically tries to import readline. In the case of pyreadline, input is read via the wide-character function ReadConsoleInputW. This is a low-level function to read console input records. In principle it should work, but in practice entering print('ä') gets read by the REPL as print(''). For a non-ASCII character, ReadConsoleInputW returns a sequence of Alt+Numpad KEY_EVENT records. The sequence is a lossy OEM encoding, which can be ignored except for the last record, which has the input character in the UnicodeChar field. Apparently pyreadline ignores the entire sequence.

Prior to Windows 8, output using codepage 65001 is also broken. It prints a trail of garbage text in proportion to the number of non-ASCII characters. In this case the problem is that WriteFile and WriteConsoleA incorrectly return the number of UTF-16 codes written to the screen buffer instead of the number of UTF-8 bytes. This confuses Python's buffered writer, leading to repeated writes of what it thinks are the remaining unwritten bytes. This problem was fixed in Windows 8 as part of rewriting the internal console API to use the ConDrv device instead of an LPC port. Older versions of Windows can use ConEmu or ANSICON to work around this bug.

这篇关于chcp 65001 codepage导致程序终止,没有任何错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆