为什么wprintf独立的Uni code结扎成两个不同的字形? [英] Why does wprintf separate Unicode ligature into two different graphemes?
问题描述
code:
#include <stdio.h>
#include <wchar.h>
#define USE_W
int main()
{
#ifdef USE_W
const wchar_t *ae_utf16 = L"\x00E6 & ASCII text ae\n";
wprintf(ae_utf16);
#else
const char *ae_utf8 = "\xC3\xA6 & ASCII text ae\n";
printf(ae_utf8);
#endif
return 0;
}
输出:
AE&安培; ASCII文本自动曝光
ae & ASCII text ae
虽然printf的产生正确的UTF-8的输出:
While printf produces correct UTF-8 output:
AE&安培; ASCII文本自动曝光
æ & ASCII text ae
您可以测试这个 rel=\"nofollow\">。
You can test this here.
推荐答案
的printf
只需发送到你的终端原始字节;它不知道编码任何事情。如果你的终端碰巧被配置为可互preT,作为UTF-8,它会显示正确的字符。
printf
just sends raw bytes to your terminal; it does not know anything about encodings. If your terminal happens to be configured to interpret that as UTF-8, it will show the right characters.
wprintf
,而另一方面,不知道编码。它行为就像它使用功能的 wcrtomb ,该连接codeS宽字符( wchar_t的
)到一个多字节序列,的根据当前的区域的。如果默认区域恰好是C
,这是相当简约,字符æ
被转换为或多或少相当于字节序列 AE
。
wprintf
, on the other hand, does know about encodings. It behaves as though it uses the function wcrtomb, which encodes a wide character (wchar_t
) into a multibyte sequence, depending on the current locale. If the default locale happens to be "C"
, which is quite minimalistic, the character æ
gets converted to the "more or less equivalent" byte sequence ae
.
如果您在使用UTF-8,如的en_US.UTF-8
,输出的是如预期。当然,一套支持的语言环境的每个系统不同,所以它没有好硬code这一点。
If you set the locale explicitly to something using UTF-8, like "en_US.UTF-8"
, the output is as expected. Of course, the set of supported locales differs per system, so it's no good to hardcode this.
这篇关于为什么wprintf独立的Uni code结扎成两个不同的字形?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!