使用 printf 打印 UTF-8 字符串 - 宽与多字节字符串文字 [英] Printing UTF-8 strings with printf - wide vs. multibyte string literals

查看:49
本文介绍了使用 printf 打印 UTF-8 字符串 - 宽与多字节字符串文字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在这样的语句中,两者都以相同的编码 (UTF-8) 输入到源代码中并且语言环境设置正确,它们之间有什么实际区别吗?

In statements like these, where both are entered into the source code with the same encoding (UTF-8) and the locale is set up properly, is there any practical difference between them?

printf("ο Δικαιοπολις εν αγρω εστιν
");
printf("%ls", L"ο Δικαιοπολις εν αγρω εστιν
");

因此,在输出时是否有任何理由更喜欢一个?我想第二个的性能会差一点,但它比多字节文字有什么优势(或劣势)吗?

And consequently is there any reason to prefer one over the other when doing output? I imagine the second performs a fair bit worse, but does it have any advantage (or disadvantage) over a multibyte literal?

这些字符串打印没有问题.但我没有使用宽字符串函数,因为我也希望能够使用 printf 等.所以问题是这些打印方式有什么不同(鉴于上面概述的情况),如果是,第二种方式有什么优势吗?

There are no issues with these strings printing. But I'm not using the wide string functions, because I want to be able to use printf etc. as well. So the question is are these ways of printing any different (given the situation outlined above), and if so, does the second one have any advantage?

按照下面的评论,我现在知道这个程序有效——我认为这是不可能的:

Following the comments below, I now know that this program works -- which I thought wasn't possible:

int main()
{
    setlocale(LC_ALL, "");
    wprintf(L"ο Δικαιοπολις εν αγρω εστιν
");  // wide output
    freopen(NULL, "w", stdout);                 // lets me switch
    printf("ο Δικαιοπολις εν αγρω εστιν
");    // byte output
}

<小时>

EDIT3:我通过查看这两种类型的情况做了一些进一步的研究.取一个更简单的字符串:


EDIT3: I've done some further research by looking at what's going on with the two types. Take a simpler string:

wchar_t *wides = L"£100 π";
char *mbs = "£100 π";

编译器正在生成不同的代码.宽字符串是:

The compiler is generating different code. The wide string is:

.string "243"
.string ""
.string ""
.string "1"
.string ""
.string ""
.string "0"
.string ""
.string ""
.string "0"
.string ""
.string ""
.string " "
.string ""
.string ""
.string "30003"
.string ""
.string ""
.string ""
.string ""
.string ""

虽然第二个是:

.string "302243100 317200"

查看 Unicode 编码,第二个是纯 UTF-8.宽字符表示为 UTF-32.我意识到这将取决于实现.

And looking at the Unicode encodings, the second is plain UTF-8. The wide character representation is UTF-32. I realise this is going to be implementation-dependent.

那么也许文字的宽字符表示更便于移植?我的系统不会直接打印UTF-16/UTF-32编码,所以会自动转换为UTF-8输出.

So perhaps the wide character representation of literals is more portable? My system will not print UTF-16/UTF-32 encodings directly, so it is being automatically converted to UTF-8 for output.

推荐答案

printf("ο Δικαιοπολις εν αγρω εστιν
");

打印字符串文字(const char*,特殊字符表示为多字节字符).尽管您可能会看到正确的输出,但在使用此类非 ASCII 字符时,您可能会遇到其他问题.例如:

prints the string literal (const char*, special characters are represented as multibyte characters). Although you might see the correct output, there are other problems you might be dealing with while working with non-ASCII characters like these. For example:

char str[] = "αγρω";
printf("%d %d
", sizeof(str), strlen(str));

输出 9 8,因为这些特殊字符中的每一个都由 2 个 char 表示.

outputs 9 8, since each of these special characters is represented by 2 chars.

虽然使用 L 前缀,但文字由宽字符 (const wchar_t*) 和 %ls 格式说明符组成,导致这些宽要转换为多字节字符 (UTF-8) 的字符.请注意,在这种情况下,应适当设置语言环境,否则此转换可能会导致输出无效:

While using the L prefix you have the literal consisting of wide characters (const wchar_t*) and %ls format specifier causes these wide characters to be converted to multibyte characters (UTF-8). Note that in this case, locale should be set appropriately otherwise this conversion might lead to the output being invalid:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main(void)
{
    setlocale(LC_ALL, "");
    printf("%ls", L"ο Δικαιοπολις εν αγρω εστιν");
    return 0;
}

但是虽然在处理宽字符时有些事情可能会变得更复杂,但其他事情可能会变得更简单、更直接.例如:

but while some things might get more complicated when working with wide characters, other things might get much simpler and more straightforward. For example:

wchar_t str[] = L"αγρω";
printf("%d %d", sizeof(str) / sizeof(wchar_t), wcslen(str));

将输出 5 4 正如人们自然期望的那样.

will output 5 4 as one would naturally expect.

一旦您决定使用宽字符串,wprintf 可用于直接打印宽字符.这里还值得注意的是,在 Windows 控制台的情况下,stdout 的翻译模式应通过调用 _setmode:

Once you decide to work with wide strings, wprintf can be used to print wide characters directly. It's also worth to note here that in case of Windows console, the translation mode of the stdout should be explicitly set to one of the Unicode modes by calling _setmode:

#include <stdio.h>
#include <wchar.h>

#include <io.h>
#include <fcntl.h>
#ifndef _O_U16TEXT
  #define _O_U16TEXT 0x20000
#endif

int main()
{
    _setmode(_fileno(stdout), _O_U16TEXT);
    wprintf(L"%s
", L"ο Δικαιοπολις εν αγρω εστιν");
    return 0;
}

这篇关于使用 printf 打印 UTF-8 字符串 - 宽与多字节字符串文字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆