为什么脚本语言不将 Unicode 输出到 Windows 控制台? [英] Why don't scripting languages output Unicode to the Windows console?

查看:20
本文介绍了为什么脚本语言不将 Unicode 输出到 Windows 控制台?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Windows 控制台已经支持 Unicode 至少十年,甚至可以追溯到 Windows NT.然而,由于某种原因,包括 Perl 和 Python 在内的主要跨平台脚本语言只输出各种 8 位编码,需要很多麻烦来解决.Perl 给出打印中的宽字符"警告,Python 给出charmap 错误并退出.为什么这么多年过去了,他们不只是简单地调用输出 UTF-16 Unicode 的 Win32 -W API 而不是强制一切通过 ANSI/代码页瓶颈?

The Windows console has been Unicode aware for at least a decade and perhaps as far back as Windows NT. However for some reason the major cross-platform scripting languages including Perl and Python only ever output various 8-bit encodings, requiring much trouble to work around. Perl gives a "wide character in print" warning, Python gives a charmap error and quits. Why on earth after all these years do they not just simply call the Win32 -W APIs that output UTF-16 Unicode instead of forcing everything through the ANSI/codepage bottleneck?

仅仅是跨平台性能的优先级较低吗?是不是语言在内部使用 UTF-8 并且觉得输出 UTF-16 太麻烦?或者 -W API 本身是否已损坏到无法按原样使用的程度?

Is it just that cross-platform performance is low priority? Is it that the languages use UTF-8 internally and find it too much bother to output UTF-16? Or are the -W APIs inherently broken to such a degree that they can't be used as-is?

更新

看来,责任可能需要各方共同承担.我想象脚本语言可以在 Windows 上调用 wprintf 并让操作系统/运行时担心诸如重定向之类的事情.但事实证明,即使是 Windows 上的 wprintf 也会将宽字符转换为 ANSI,然后再打印到控制台!

It seems that the blame may need to be shared by all parties. I imagined that the scripting languages could just call wprintf on Windows and let the OS/runtime worry about things such as redirection. But it turns out that even wprintf on Windows converts wide characters to ANSI and back before printing to the console!

请让我知道是否已修复此问题,因为错误报告链接似乎已损坏,但我的 Visual C 测试代码仍然无法用于 wprintf 而对于 WriteConsoleW 成功.

Please let me know if this has been fixed since the bug report link seems broken but my Visual C test code still fails for wprintf and succeeds for WriteConsoleW.

更新 2

实际上,您可以使用 wprintf 从 C 将 UTF-16 打印到控制台,但前提是您首先执行 _setmode(_fileno(stdout), _O_U16TEXT).

Actually you can print UTF-16 to the console from C using wprintf but only if you first do _setmode(_fileno(stdout), _O_U16TEXT).

从 C 中,您可以将 UTF-8 打印到代码页设置为代码页 65001 的控制台,但是 Perl、Python、PHP 和 Ruby 都有阻止此操作的错误.Perl 和 PHP 通过在包含至少一个宽字符的行后面添加额外的空行来破坏输出.Ruby 的损坏输出略有不同.Python 崩溃.

From C you can print UTF-8 to a console whose codepage is set to codepage 65001, however Perl, Python, PHP and Ruby all have bugs which prevent this. Perl and PHP corrupt the output by adding additional blank lines following lines which contain at least one wide character. Ruby has slightly different corrupt output. Python crashes.

更新 3

Node.js 是第一种开箱即用且没有此问题的脚本语言.

Node.js is the first scripting language that shipped without this problem straight out of the box.

Python 开发团队慢慢意识到这是一个真正的问题,因为 它在2007 年,并在 2016 年看到了大量活动以充分了解和完全修复错误.

The Python dev team slowly came to realize this was a real problem since it was first reported back at the end of 2007 and has seen a huge flurry of activity to fully understand and fully fix the bug in 2016.

推荐答案

主要问题似乎是无法在仅使用标准 C 库而不使用平台相关或第三方扩展的 Windows 上使用 Unicode.您提到的语言源自 Unix 平台,其实现 Unicode 的方法与 C 很好地融合在一起(它们使用普通的 char* 字符串、C 语言环境函数和 UTF-8).如果您想在 C 中执行 Unicode,您或多或少必须编写两次:一次使用非标准的 Microsoft 扩展,一次使用所有其他操作系统的标准 C API 函数.虽然可以做到这一点,但它通常没有高优先级,因为它很麻烦,而且大多数脚本语言开发人员无论如何都讨厌或忽略 Windows.

The main problem seems to be that it is not possible to use Unicode on Windows using only the standard C library and no platform-dependent or third-party extensions. The languages you mentioned originate from Unix platforms, whose method of implementing Unicode blends well with C (they use normal char* strings, the C locale functions, and UTF-8). If you want to do Unicode in C, you more or less have to write everything twice: once using nonstandard Microsoft extensions, and once using the standard C API functions for all other operating systems. While this can be done, it usually doesn't have high priority because it's cumbersome and most scripting language developers either hate or ignore Windows anyway.

在技术层面上,我认为大多数标准库设计者做出的基本假设是所有 I/O 流本质上都是基于操作系统级别的字节,这适用于所有操作系统上的文件,以及所有类 Unix 系统上的流,Windows 控制台是唯一的例外.因此,如果要合并 Windows 控制台 I/O,必须对许多类库和编程语言标准的体系结构进行很大程度的修改.

At a more technical level, I think the basic assumption that most standard library designers make is that all I/O streams are inherently byte-based on the OS level, which is true for files on all operating systems, and for all streams on Unix-like systems, with the Windows console being the only exception. Thus the architecture many class libraries and programming language standard have to be modified to a great extent if one wants to incorporate Windows console I/O.

另外一个比较主观的观点是微软在推广Unicode的使用上做得还不够.第一个(在当时)支持 Unicode 的 Windows 操作系统是 Windows NT 3.1,它于 1993 年发布,早在 Linux 和 OS X 支持 Unicode 之前很久.尽管如此,在这些操作系统中向 Unicode 的过渡更加无缝且没有问题.微软再次听取了销售人员而不是工程师的意见,并将技术上过时的 Windows 9x 保留到 2001 年;他们并没有强迫开发人员使用干净的 Unicode 接口,而是仍然提供损坏的和现在不需要的 8 位 API 接口,并邀请程序员使用它(查看 Stack Overflow 上最近的一些 Windows API 问题,大多数新手仍然使用可怕的遗留 API!).

Another more subjective point is that Microsoft just did not enough to promote the use of Unicode. The first Windows OS with decent (for its time) Unicode support was Windows NT 3.1, released in 1993, long before Linux and OS X grew Unicode support. Still, the transition to Unicode in those OSes has been much more seamless and unproblematic. Microsoft once again listened to the sales people instead of the engineers, and kept the technically obsolete Windows 9x around until 2001; instead of forcing developers to use a clean Unicode interface, they still ship the broken and now-unnecessary 8-bit API interface, and invite programmers to use it (look at a few of the recent Windows API questions on Stack Overflow, most newbies still use the horrible legacy API!).

当 Unicode 出现时,很多人意识到它很有用.Unicode 最初是一种纯 16 位编码,因此使用 16 位代码单元是很自然的.微软然后显然说好吧,我们有这个 16 位编码,所以我们必须创建一个 16 位 API",没有意识到没有人会使用它.然而,Unix 杰出人士认为我们如何以一种有效且向后兼容的方式将其集成到当前系统中,以便人们实际使用它?"并随后发明了 UTF-8,这是一项辉煌的工程.就像创建 Unix 时一样,Unix 人想得更多,需要更长的时间,在财务上的成功更少,但最终做对了.

When Unicode came out, many people realized it was useful. Unicode started as a pure 16-bit encoding, so it was natural to use 16-bit code units. Microsoft then apparently said "OK, we have this 16-bit encoding, so we have to create a 16-bit API", not realizing that nobody would use it. The Unix luminaries, however, thought "how can we integrate this into the current system in an efficient and backward-compatible way so that people will actually use it?" and subsequently invented UTF-8, which is a brilliant piece of engineering. Just as when Unix was created, the Unix people thought a bit more, needed a bit longer, has less financially success, but did it eventually right.

我无法评论 Perl(但我认为 Perl 社区中讨厌 Windows 的人比 Python 社区中的更多),但是关于 Python,我知道 BDFL(也不喜欢 Windows)已经声明在所有平台上提供足够的 Unicode 支持是一个主要目标.

I cannot comment on Perl (but I think that there are more Windows haters in the Perl community than in the Python community), but regarding Python I know that the BDFL (who doesn't like Windows as well) has stated that adequate Unicode support on all platforms is a major goal.

这篇关于为什么脚本语言不将 Unicode 输出到 Windows 控制台?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆