为什么wprintf独立的Uni code结扎成两个不同的字形? [英] Why does wprintf separate Unicode ligature into two different graphemes?

查看:160
本文介绍了为什么wprintf独立的Uni code结扎成两个不同的字形?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

code:

#include <stdio.h>
#include <wchar.h>
#define USE_W
int main()
{
#ifdef USE_W
    const wchar_t *ae_utf16 = L"\x00E6 & ASCII text ae\n";
    wprintf(ae_utf16);
#else
    const char *ae_utf8 = "\xC3\xA6 & ASCII text ae\n";
    printf(ae_utf8);
#endif
    return 0;
}

输出:

AE&安培; ASCII文本自动曝光

ae & ASCII text ae

虽然printf的产生正确的UTF-8的输出:

While printf produces correct UTF-8 output:

AE&安培; ASCII文本自动曝光

æ & ASCII text ae

您可以测试这个 rel=\"nofollow\">。

You can test this here.

推荐答案

的printf 只需发送到你的终端原始字节;它不知道编码任何事情。如果你的终端碰巧被配置为可互preT,作为UTF-8,它会显示正确的字符。

printf just sends raw bytes to your terminal; it does not know anything about encodings. If your terminal happens to be configured to interpret that as UTF-8, it will show the right characters.

wprintf ,而另一方面,不知道编码。它行为就像它使用功能的 wcrtomb ,该连接codeS宽字符( wchar_t的)到一个多字节序列,的根据当前的区域的。如果默认区域恰好是C,这是相当简约,字符æ被转换为或多或少相当于字节序列 AE

wprintf, on the other hand, does know about encodings. It behaves as though it uses the function wcrtomb, which encodes a wide character (wchar_t) into a multibyte sequence, depending on the current locale. If the default locale happens to be "C", which is quite minimalistic, the character æ gets converted to the "more or less equivalent" byte sequence ae.

如果您在使用UTF-8,如的en_US.UTF-8,输出的是如预期。当然,一套支持的语言环境的每个系统不同,所以它没有好硬code这一点。

If you set the locale explicitly to something using UTF-8, like "en_US.UTF-8", the output is as expected. Of course, the set of supported locales differs per system, so it's no good to hardcode this.

这篇关于为什么wprintf独立的Uni code结扎成两个不同的字形?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆