如何在 Windows 控制台上输出 Unicode 字符串 [英] How to Output Unicode Strings on the Windows Console

查看:20
本文介绍了如何在 Windows 控制台上输出 Unicode 字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

已经有一些与此问题相关的问题.我认为我的问题有点不同,因为我没有实际问题,我只是出于学术兴趣而询问.我知道 Windows 对 UTF-16 的实现有时与 Unicode 标准(例如排序规则)相矛盾,或者更接近于旧的 UCS-2 而不是 UTF-16,但出于以下原因,我将在此处保留UTF-16"术语简单.

there are already a few questions relating to this problem. I think my question is a bit different because I don't have an actual problem, I'm only asking out of academic interest. I know that Windows's implementation of UTF-16 is sometimes contradictory to the Unicode standard (e.g. collation) or closer to the old UCS-2 than to UTF-16, but I'll keep the "UTF-16" terminology here for reasons of simplicity.

背景:在 Windows 中,一切都是 UTF-16.无论您是在处理内核、图形子系统、文件系统还是其他任何东西,您都在传递 UTF-16 字符串.没有 Unix 意义上的语言环境或字符集.为了与中世纪版本的 Windows 兼容,有一种叫做代码页"的东西已经过时但仍然受支持.AFAIK,只有一个正确且非过时的函数可以将字符串写入控制台,即 WriteConsoleW,它采用 UTF-16 字符串.此外,类似的讨论也适用于输入流,我也将忽略.

Background: In Windows, everything is UTF-16. Regardless of whether you're dealing with the kernel, the graphics subsystem, the filesystem or whatever, you're passing UTF-16 strings. There are no locales or charsets in the Unix sense. For compatibility with medieval versions of Windows, there is a thing called "codepages" that is obsolete but nonetheless supported. AFAIK, there is only one correct and non-obsolete function to write strings to the console, namely WriteConsoleW, which takes an UTF-16 string. Also, a similar discussion applies to input streams, which I'll ignore, too.

但是,我认为这代表了 Windows API 中的一个设计缺陷:有一个通用函数可用于写入所有名为 WriteFile 的流对象(文件、管道、控制台...),但是这个函数是面向字节的,不接受 UTF-16 字符串.该文档建议将 WriteConsoleW 用于控制台输出,这是面向文本的,而 WriteFile 用于其他所有内容,这是面向字节的.由于控制台流和文件对象都由内核对象句柄表示并且控制台流可以重定向,因此您必须为每次写入标准输出流调用一个函数,以检查句柄是代表控制台流还是文件,从而打破了多态性.OTOH,我确实认为 Windows 在文本字符串和原始字节之间的分离(在 Java 或 Python 等许多其他系统中镜像)在概念上优于 Unix 的 char* 方法,该方法忽略编码并且不区分字符串和字节数组之间.

However, I think this represents a design flaw in the Windows API: there is a generic function that can be used to write to all stream objects (files, pipes, consoles…) called WriteFile, but this function is byte-oriented and doesn't accept UTF-16 strings. The documentation suggests using WriteConsoleW for console output, which is text oriented, and WriteFile for everything else, which is byte oriented. Since both console streams and file objects are represented by kernel object handles and console streams can be redirected, you have to call a function for every write to a standard output stream that checks whether the handle represents a console stream or a file, breaking polymorphy. OTOH, I do think that Windows's separation between text strings and raw bytes (which is mirrored in many other systems like Java or Python) is conceptually superior to Unix's char* approach that ignores encodings and doesn't distinguish between strings and byte arrays.

所以我的问题是:在这种情况下该怎么办?为什么这个问题即使在微软自己的库中也没有解决?.NET Framework 和 C 和 C++ 库似乎都遵循过时的代码页模型.您将如何设计 Windows API 或应用程序框架来规避此问题?

So my questions are: What to do in this situation? And why isn't this problem solved even in Microsoft's own libraries? Both the .NET Framework and the C and C++ libraries seem to adhere to the obsolete codepage model. How would you design the Windows API or an application framework to circumvent this issue?

我认为一般问题(不容易解决)是所有库都假设所有流都是面向字节的,并在此基础上实现面向文本的流.但是,我们看到 Windows 在操作系统级别确实具有特殊的面向文本的流,而库无法处理此问题.因此,无论如何我们必须对所有标准库进行重大更改.一种快速而肮脏的方法是将控制台视为只接受一种编码的特殊面向字节的流.这仍然要求必须绕过 C 和 C++ 标准库,因为它们没有实现 WriteFile/WriteConsoleW 开关.对吗?

I think that the general problem (which is not easy to solve) is that all libraries assume that all streams are byte-oriented, and implement text-oriented streams on top of that. However, we see that Windows does have special text-oriented streams on the OS level, and the libraries are unable to deal with this. So in any case we must introduce significant changes to all standard libraries. A quick and dirty way would be to treat the console as a special byte-oriented stream that accepts only one encoding. This still requires that the C and C++ standard libraries must be circumvented because they don't implement the WriteFile/WriteConsoleW switch. Is that correct?

推荐答案

我/我们在大多数(跨平台)应用程序/项目中使用的一般策略是:我们只在任何地方使用 UTF-8(我的意思是真正的标准).我们使用 std::string 作为容器,我们只是将 一切 解释为 UTF8.我们也以这种方式处理所有文件 IO,即我们期望 UTF8 并保存 UTF8.如果我们从某处得到一个字符串并且我们知道它不是 UTF8,我们会将其转换为 UTF8.

The general strategy I/we use in most (cross platform) applications/projects is: We just use UTF-8 (I mean the real standard) everywhere. We use std::string as the container and we just interpret everything as UTF8. And we also handle all file IO this way, i.e. we expect UTF8 and save UTF8. In the case when we get a string from somewhere and we know that it is not UTF8, we will convert it to UTF8.

我们偶然发现 WinUTF16 的最常见情况是文件名.所以对于每个文件名处理,我们总是将 UTF8 字符串转换为 WinUTF16.如果我们在目录中搜索文件,也是另一种方式.

The most common case where we stumble upon WinUTF16 is for filenames. So for every filename handling, we will always convert the UTF8 string to WinUTF16. And also the other way if we search through a directory for files.

在我们的 Windows 版本中并没有真正使用控制台(在 Windows 版本中,所有控制台输出都被包装到一个文件中).因为我们到处都有 UTF8,所以我们的控制台输出也是 UTF8,这对于大多数现代系统来说都很好.而且 Windows 控制台日志文件的内容是 UTF8,Windows 上的大多数文本编辑器都可以毫无问题地读取它.

The console isn't really used in our Windows build (in the Windows build, all console output is wrapped into a file). As we have UTF8 everywhere, also our console output is UTF8 which is fine for most modern systems. And also the Windows console log file has its content in UTF8 and most text-editors on Windows can read that without problems.

如果我们更多地使用 WinConsole 并且我们非常关心所有特殊字符是否正确显示,我们可能会编写一些自动管道处理程序,我们将其安装在 fileno=0 和真正的 stdout 将按照您的建议使用 WriteConsoleW(如果真的没有更简单的方法).

If we would use the WinConsole more and if we would care a lot that all special chars are displayed correctly, we maybe would write some automatic pipe handler which we install in between fileno=0 and the real stdout which will use WriteConsoleW as you have suggested (if there is really no easier way).

如果你想知道如何实现这样的自动管道处理程序:我们已经为所有类似 POSIX 的系统实现了这样的东西.该代码可能无法在 Windows 上正常工作,但我认为应该可以移植它.我们当前的管道处理程序类似于 tee 所做的.IE.如果您执行 cout <<你好"<<endl,它将同时打印在 stdout 和一些日志文件中.看看 代码,如果您对这是如何完成的感兴趣的话.

If you wonder about how to realize such automatic pipe handler: We have implemented such thing already for all POSIX-like systems. The code probably doesn't work on Windows as it is but I think it should be possible to port it. Our current pipe handler is similar to what tee does. I.e. if you do a cout << "Hello" << endl, it will both be printed on stdout and in some log-file. Look at the code if you are interested how this is done.

这篇关于如何在 Windows 控制台上输出 Unicode 字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆