chcp 65001 代码页导致程序终止而没有任何错误 [英] chcp 65001 codepage results in program termination without any error

查看:223
本文介绍了chcp 65001 代码页导致程序终止而没有任何错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题
当我想在 Python 解释器中输入 Unicode 字符时出现问题(为简单起见,我在示例中使用了变音符号,但我首先遇到了波斯语字符).每当我将 python 与 chcp 65001 代码页一起使用,然后尝试输入甚至一个 Unicode 字符时,Python 都会毫无错误地退出.

Problem
The problem arises when I want to input Unicode character in Python interpreter (for simplicity I have used a-umlaut in the example, but I have first encountered this for Farsi characters). Whenever I use python with chcp 65001 code page and then try to input even one Unicode character, Python exits without any error.

我花了几天时间试图解决这个问题,但无济于事.但是今天,我在 python 网站上找到了一个帖子,另一个在 MySQL 和 Lua-users 上的另一个关于突然退出的问题,尽管没有任何解决方案,有些人说 chcp 65001 本来就坏了.

I have spent days trying to solve this problem to no avail. But today, I found a thread on python website, another on MySQL and another on Lua-users which issues were raised regarding this sudden exit, although without any solution and some saying that chcp 65001 is inherently broken.

最好一劳永逸地了解此问题是否与 chcp-design 相关,或者是否有可能的解决方法.

It would be good to know once and for all whether this problem is chcp-design-related or there is a possible workaround.

重现错误

chcp 65001

Python 3.X:

Python 3.X:

Python 外壳

print('ä')

结果:它只是退出shell

result: it just exits the shell

但是,这是有效的python.exe -c "p​​rint('ä')"还有这个:print('u00e4')

however, this works python.exe -c "print('ä')" and also this : print('u00e4')

结果:ä

在 Luajit2.0.4 中

in Luajit2.0.4

print('ä')

结果:它只是退出shell

result: it just exits the shell

然而这是有效的:print('xc3xa4')

到目前为止我已经提出了这个观察:

  1. 使用命令提示符直接输出有效.
  2. 基于 Unicode 和基于十六进制的等效字符作品.

所以这不是 Python 错误,我们不能在 Windows 命令提示符下的 CLI 程序中直接使用 Unicode 字符或其任何包装器,如 Conemu、Cmder(我使用 Cmder 能够看到并在 Windows shell 中使用 Unicode 字符,我这样做没有任何问题).这是正确的吗?

So This is not a Python bug and that we can't use a Unicode character directly in CLI programs in Windows command prompt or any of its Wrapper like Conemu, Cmder (I am using Cmder to be able to see and use Unicode character in Windows shell and I have done so without any problem). Is this correct?

推荐答案

要在 Python 2.7 和 3.x(3.6 之前)的 Windows 控制台中使用 Unicode,请安装并启用 win_unicode_console.这使用宽字符函数 ReadConsoleWWriteConsoleW,就像其他支持 Unicode 的控制台程序一样作为 cmd.exe 和 powershell.exe.对于 Python 3.6,添加了一个新的 io._WindowsConsoleIO 原始 I/O 类.它读取和写入 UTF-8 编码的文本(为了与 Unix 的跨平台兼容性——获取一个字节"——程序),但在内部它通过转码到 UTF-16LE 和从 UTF-16LE 转码来使用宽字符 API.

To use Unicode in the Windows console for Python 2.7 and 3.x (prior to 3.6), install and enable win_unicode_console. This uses the wide-character functions ReadConsoleW and WriteConsoleW, just like other Unicode-aware console programs such as cmd.exe and powershell.exe. For Python 3.6, a new io._WindowsConsoleIO raw I/O class has been added. It reads and writes UTF-8 encoded text (for cross-platform compatibility with Unix -- "get a byte" -- programs), but internally it uses the wide-character API by transcoding to and from UTF-16LE.

您在使用非 ASCII 输入时遇到的问题可在所有 Windows 版本(包括 Windows 10)的控制台中重现.控制台主机进程,即 conhost.exe,不是为 UTF-8(代码页65001)并且尚未更新以始终如一地支持它.特别是,非 ASCII 输入会导致空读取.这反过来会导致 Python 的 REPL 退出并且内置 input 引发 EOFError.

The problem you're experiencing with non-ASCII input is reproducible in the console for all Windows versions up to and including Windows 10. The console host process, i.e. conhost.exe, wasn't designed for UTF-8 (codepage 65001) and hasn't been updated to support it consistently. In particular, non-ASCII input causes an empty read. This in turn causes Python's REPL to exit and built-in input to raise EOFError.

问题在于 conhost 假设单字节代码页对其 UTF-16 输入缓冲区进行编码,例如西方语言环境中的 OEM 和 ANSI 代码页(例如 437、850、1252).UTF-8 是一种多字节编码,其中非 ASCII 字符被编码为 2 到 4 个字节.要处理 UTF-8,它需要对 M/4 个字符的多次迭代进行编码,其中 M 是 N 字节缓冲区中可用的剩余字节.相反,它假定读取 N 个字节的请求是读取 N 个字符的请求.然后,如果输入有一个或多个非 ASCII 字符,内部 WideCharToMultiByte 调用会由于缓冲区过小而失败,并且控制台返回 0 字节的成功"读取.

The problem is that conhost encodes its UTF-16 input buffer assuming a single-byte codepage, such as the OEM and ANSI codepages in Western locales (e.g. 437, 850, 1252). UTF-8 is a multibyte encoding in which non-ASCII characters are encoded as 2 to 4 bytes. To handle UTF-8 it would need to encode in multiple iterations of M / 4 characters, where M is the remaining bytes available from the N-byte buffer. Instead it assumes a request to read N bytes is a request to read N characters. Then if the input has one or more non-ASCII characters, the internal WideCharToMultiByte call fails due to an undersized buffer, and the console returns a 'successful' read of 0 bytes.

如果安装了 pyreadline 模块,您可能不会在 Python 3.5 中完全观察到这个问题.Python 3.5 会自动尝试导入 readline.在 pyreadline 的情况下,输入是通过宽字符函数读取的 ReadConsoleInputW.这是读取控制台输入记录的低级函数.原则上它应该可以工作,但实际上输入 print('ä') 会被 REPL 读取为 print('').对于非 ASCII 字符,ReadConsoleInputW 返回 Alt+Numpad KEY_EVENT 记录序列.该序列是一种有损 OEM 编码,除最后一条记录外可以忽略,该记录在 UnicodeChar 字段中具有输入字符.显然 pyreadline 忽略了整个序列.

You may not observe exactly this problem in Python 3.5 if the pyreadline module is installed. Python 3.5 automatically tries to import readline. In the case of pyreadline, input is read via the wide-character function ReadConsoleInputW. This is a low-level function to read console input records. In principle it should work, but in practice entering print('ä') gets read by the REPL as print(''). For a non-ASCII character, ReadConsoleInputW returns a sequence of Alt+Numpad KEY_EVENT records. The sequence is a lossy OEM encoding, which can be ignored except for the last record, which has the input character in the UnicodeChar field. Apparently pyreadline ignores the entire sequence.

在 Windows 8 之前,使用代码页 65001 的输出也已损坏.它按照非 ASCII 字符的数量按比例打印垃圾文本.在这种情况下,问题是 WriteFileWriteConsoleA 错误地返回了写入屏幕缓冲区的 UTF-16 代码的数量,而不是 UTF-8 字节的数量.这会混淆 Python 的缓冲写入器,导致重复写入它认为是剩余未写入字节的内容.作为重写内部控制台 API 以使用 ConDrv 设备而不是 LPC 端口的一部分,此问题已在 Windows 8 中得到修复.旧版本的 Windows 可以使用 ConEmu 或 ANSICON 来解决此错误.

Prior to Windows 8, output using codepage 65001 is also broken. It prints a trail of garbage text in proportion to the number of non-ASCII characters. In this case the problem is that WriteFile and WriteConsoleA incorrectly return the number of UTF-16 codes written to the screen buffer instead of the number of UTF-8 bytes. This confuses Python's buffered writer, leading to repeated writes of what it thinks are the remaining unwritten bytes. This problem was fixed in Windows 8 as part of rewriting the internal console API to use the ConDrv device instead of an LPC port. Older versions of Windows can use ConEmu or ANSICON to work around this bug.

这篇关于chcp 65001 代码页导致程序终止而没有任何错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆