在 Windows 控制台中正确打印 utf8 字符 [英] Properly print utf8 characters in windows console

查看:59
本文介绍了在 Windows 控制台中正确打印 utf8 字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我尝试这样做的方式:

This is the way I try to do it:

#include <stdio.h>
#include <windows.h>
using namespace std;

int main() {
  SetConsoleOutputCP(CP_UTF8);
   //german chars won't appear
  char const* text = "aäbcdefghijklmnoöpqrsßtuüvwxyz";
  int len = MultiByteToWideChar(CP_UTF8, 0, text, -1, 0, 0);
  wchar_t *unicode_text = new wchar_t[len];
  MultiByteToWideChar(CP_UTF8, 0, text, -1, unicode_text, len);
  wprintf(L"%s", unicode_text);
}

而且效果是只显示我们的 ascii 字符.没有显示错误.源文件采用utf8编码.

And the effect is that only us ascii chars are displayed. No errors are shown. The source file is encoded in utf8.

那么,我在这里做错了什么?

So, what I'm doing wrong here ?

致 WouterH:

int main() {
  SetConsoleOutputCP(CP_UTF8);
  const wchar_t *unicode_text = L"aäbcdefghijklmnoöpqrsßtuüvwxyz";
  wprintf(L"%s", unicode_text);
}

  • 这也不起作用.效果是一样的.我的字体当然是 Lucida Console.
  • 第三次拍摄:

    #include <stdio.h>
    #define _WIN32_WINNT 0x05010300
    #include <windows.h>
    #define _O_U16TEXT  0x20000
    #include <fcntl.h>
    
    using namespace std;
    
    int main() {
        _setmode(_fileno(stdout), _O_U16TEXT);
        const wchar_t *u_text = L"aäbcdefghijklmnoöpqrsßtuüvwxyz";
        wprintf(L"%s", u_text);
    }
    

    好的,有些东西开始起作用了,但输出是:ańbcdefghijklmno÷pqrs▀tuŘvwxyz.

    ok, something begins to work, but the output is: ańbcdefghijklmno÷pqrs▀tuŘvwxyz.

    推荐答案

    默认情况下,Windows 上的宽打印功能不处理 ascii 范围之外的字符.

    By default the wide print functions on Windows do not handle characters outside the ascii range.

    有几种方法可以将 Unicode 数据发送到 Windows 控制台.

    There are a few ways to get Unicode data to the Windows console.

    • 直接使用控制台 API,WriteConsoleW.您必须确保您实际上是在向控制台写入数据,并在输出到其他内容时使用其他方式.

    • use the console API directly, WriteConsoleW. You'll have to ensure you're actually writing to a console and use other means when the output is to something else.

    将标准输出文件描述符的模式设置为Unicode"模式之一,_O_U16TEXT 或 _O_U8TEXT.这会导致宽字符输出函数将 Unicode 数据正确输出到 Windows 控制台.如果它们用于不代表控制台的文件描述符,那么它们会导致字节的输出流分别为 UTF-16 和 UTF-8.注意设置这些模式后,相应流上的非宽字符函数将无法使用并导致崩溃.您只能使用宽字符函数.

    set the mode of the standard output file descriptors to one of the 'Unicode' modes, _O_U16TEXT or _O_U8TEXT. This causes the wide character output functions to correctly output Unicode data to the Windows console. If they're used on file descriptors that don't represent a console then they cause the output stream of bytes to be UTF-16 and UTF-8 respectively. N.B. after setting these modes the non-wide character functions on the corresponding stream are unusable and result in a crash. You must use only the wide character functions.

    UTF-8 文本可以通过将控制台输出代码页设置为 CP_UTF8 直接打印到控制台,如果您使用正确的函数.大多数高级函数,例如 basic_ostream::operator<<<(char*) 不能以这种方式工作,但您可以使用低级函数或实现自己的 ostream解决了标准函数存在的问题.

    UTF-8 text can be printed directly to the console by setting the console output codepage to CP_UTF8, if you use the right functions. Most of the higher level functions such as basic_ostream<char>::operator<<(char*) don't work this way, but you can either use lower level functions or implement your own ostream that works around the problem the standard functions have.

    第三种方法的问题是:

    putc('302'); putc('260'); // doesn't work with CP_UTF8
    
    puts("302260"); // correctly writes UTF-8 data to Windows console with CP_UTF8 
    

    与大多数操作系统不同,Windows 上的控制台不仅仅是另一个接受字节流的文件.它是由程序创建和拥有的特殊设备,并通过其自己独特的 WIN32 API 访问.问题在于,当写入控制台时,API 准确地看到在使用其 API 时传递的数据的范围,并且从窄字符到宽字符的转换发生时没有考虑数据可能不完整.当使用多次调用控制台 API 传递多字节字符时,每个单独传递的部分都被视为非法编码,并被视为非法编码.

    Unlike most operating systems, the console on Windows is not simply another file that accepts a stream of bytes. It's a special device created and owned by the program and accessed via its own unique WIN32 API. The issue is that when the console is written to, the API sees exactly the extent of the data passed in that use of its API, and the conversion from narrow characters to wide characters occurs without considering that the data may be incomplete. When a multibyte character is passed using more than one call to the console API, each separately passed piece is seen as an illegal encoding, and is treated as such.

    解决这个问题应该很容易,但 Microsoft 的 CRT 团队认为这不是他们的问题,而在控制台上工作的任何团队都可能不在乎.

    It ought to be easy enough to work around this, but the CRT team at Microsoft views it as not their problem whereas whatever team works on the console probably doesn't care.

    您可以通过实现您自己的 streambuf 子类来解决它,该子类可以正确处理向 wchar_t 的转换.IE.考虑到多字节字符的字节可能单独出现,在写入之间保持转换状态(例如,std::mbstate_t).

    You might solve it by implementing your own streambuf subclass which handles doing the conversion to wchar_t correctly. I.e. accounting for the fact that bytes of multibyte characters may come separately, maintaining conversion state between writes (e.g., std::mbstate_t).

    这篇关于在 Windows 控制台中正确打印 utf8 字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆