char vs wchar_t vs char16_t和char32_t(c ++ 11) [英] char vs wchar_t vs char16_t vs char32_t (c++11)
问题描述
从我的理解, char
是安全的ASCII字符,而 char16_t
和 char32_t
可以安全地容纳unicode中的字符,一个用于16位品种,另一个用于32位品种(我应该说a而不是the吗?但我仍然想知道 wchar_t
背后的目的是什么。我应该在新代码中使用该类型,还是仅仅支持旧代码?在旧代码中 wchar_t
的目的是什么,如果从我的理解,它的大小不能保证大于一个 char
?澄清会很好!
From what I understand, a char
is safe to house ASCII characters whereas char16_t
and char32_t
are safe to house characters from unicode, one for the 16-bit variety and another for the 32-bit variety (Should I have said "a" instead of "the"?). But I'm then left wondering what the purpose behind the wchar_t
is. Should I ever use that type in new code, or is it simply there to support old code? What was the purpose of wchar_t
in old code if, from what I understand, its size had no guarantee to be bigger than a char
? Clarification would be nice!
推荐答案
char
是用于8位代码单位, char16_t
用于16位代码单位, char32_t
用于32位代码单位。任何这些都可以用于'Unicode'; UTF-8使用8位代码单位,UTF-16使用16位代码单位,UTF-32使用32位代码单位。
char
is for 8-bit code units, char16_t
is for 16-bit code units, and char32_t
is for 32-bit code units. Any of these can be used for 'Unicode'; UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units, and UTF-32 uses 32-bit code units.
对 wchar_t
的保证是,语言环境中支持的任何字符都可以从 char
到 wchar_t
,以及用于 char
的任何表示形式,无论是多字节,移位代码, wchar_t
将是一个单独的值。这样做的目的是,你可以操作 wchar_t
字符串,就像使用ASCII的简单算法。
The guarantee made for wchar_t
was that any character supported in a locale could be converted from char
to wchar_t
, and whatever representation was used for char
, be it multiple bytes, shift codes, what have you, the wchar_t
would be a single, distinct value. The purpose of this was that then you could manipulate wchar_t
strings just like the simple algorithms used with ASCII.
例如,将ascii转换为大写格式如下:
For example, converting ascii to upper case goes like:
auto loc = std::locale("");
char s[] = "hello";
for (char &c : s) {
c = toupper(c, loc);
}
但是这不会处理将UTF-8中的所有字符转换为大写,或者像Shift-JIS等其他一些编码中的所有字符。人们希望能够将此代码国际化,例如:
But this won't handle converting all characters in UTF-8 to uppercase, or all characters in some other encoding like Shift-JIS. People wanted to be able to internationalize this code like so:
auto loc = std::locale("");
wchar_t s[] = L"hello";
for (wchar_t &c : s) {
c = toupper(c, loc);
}
因此每个 wchar_t
是一个'字符',如果它有一个大写版本,那么它可以直接转换。不幸的是,这不是真的工作的所有时间;例如,在一些语言中存在异常,例如德国字母ß,其中大写版本实际上是两个字符SS而不是单个字符。
So every wchar_t
is a 'character' and if it has an uppercase version then it can be directly converted. Unfortunately this doesn't really work all the time; For example there exist oddities in some languages such as the German letter ß where the uppercase version is actually the two characters SS instead of a single character.
因此,国际化的文本处理本质上比ASCII更难,并且不能真正地以 wchar_t
的设计者的方式来简化。因为 wchar_t
和宽字符通常提供的价值不大。
So internationalized text handling is intrinsically harder than ASCII and cannot really be simplified in the way the designers of wchar_t
intended. As such wchar_t
and wide characters in general provide little value.
使用它们的唯一原因是,已被烘焙到一些API和平台。但是,我喜欢在我自己的代码中坚持使用UTF-8,即使在这样的平台上开发,并且只需要在API边界处转换为需要的任何编码。
The only reason to use them is that they've been baked into some APIs and platforms. However, I prefer to stick to UTF-8 in my own code even when developing on such platforms, and to just convert at the API boundaries to whatever encoding is required.
这篇关于char vs wchar_t vs char16_t和char32_t(c ++ 11)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!