std::wstring VS std::string [英] std::wstring VS std::string

查看:45
本文介绍了std::wstring VS std::string的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法理解 std::stringstd::wstring 之间的区别.我知道 wstring 支持宽字符,例如 Unicode 字符.我有以下问题:

I am not able to understand the differences between std::string and std::wstring. I know wstring supports wide characters such as Unicode characters. I have got the following questions:

  1. 我什么时候应该使用 std::wstring 而不是 std::string?
  2. std::string 能否保存整个 ASCII 字符集,包括特殊字符?
  3. 所有流行的 C++ 编译器都支持 std::wstring 吗?
  4. 什么是宽字符"?
  1. When should I use std::wstring over std::string?
  2. Can std::string hold the entire ASCII character set, including the special characters?
  3. Is std::wstring supported by all popular C++ compilers?
  4. What is exactly a "wide character"?

推荐答案

string?wstring?

std::string 是一个 basic_stringchar 上模板化,std::wstringwchar_t.

string? wstring?

std::string is a basic_string templated on a char, and std::wstring on a wchar_t.

char 应该保存一个字符,通常是一个 8 位字符.wchar_t 应该包含一个宽字符,然后,事情变得棘手:在 Linux 上,一个 wchar_t 是 4 个字节,而在 Windows 上,它是 2 个字节.

char is supposed to hold a character, usually an 8-bit character. wchar_t is supposed to hold a wide character, and then, things get tricky: On Linux, a wchar_t is 4 bytes, while on Windows, it's 2 bytes.

问题在于 charwchar_t 都没有直接绑定到 unicode.

The problem is that neither char nor wchar_t is directly tied to unicode.

让我们以 Linux 操作系统为例:我的 Ubuntu 系统已经支持 unicode.当我使用字符字符串时,它以 UTF-8(即字符).代码如下:

Let's take a Linux OS: My Ubuntu system is already unicode aware. When I work with a char string, it is natively encoded in UTF-8 (i.e. Unicode string of chars). The following code:

#include <cstring>
#include <iostream>

int main()
{
    const char text[] = "olé";


    std::cout << "sizeof(char)    : " << sizeof(char) << "
";
    std::cout << "text            : " << text << "
";
    std::cout << "sizeof(text)    : " << sizeof(text) << "
";
    std::cout << "strlen(text)    : " << strlen(text) << "
";

    std::cout << "text(ordinals)  :";

    for(size_t i = 0, iMax = strlen(text); i < iMax; ++i)
    {
        unsigned char c = static_cast<unsigned_char>(text[i]);
        std::cout << " " << static_cast<unsigned int>(c);
    }

    std::cout << "

";

    // - - -

    const wchar_t wtext[] = L"olé" ;

    std::cout << "sizeof(wchar_t) : " << sizeof(wchar_t) << "
";
    //std::cout << "wtext           : " << wtext << "
"; <- error
    std::cout << "wtext           : UNABLE TO CONVERT NATIVELY." << "
";
    std::wcout << L"wtext           : " << wtext << "
";

    std::cout << "sizeof(wtext)   : " << sizeof(wtext) << "
";
    std::cout << "wcslen(wtext)   : " << wcslen(wtext) << "
";

    std::cout << "wtext(ordinals) :";

    for(size_t i = 0, iMax = wcslen(wtext); i < iMax; ++i)
    {
        unsigned short wc = static_cast<unsigned short>(wtext[i]);
        std::cout << " " << static_cast<unsigned int>(wc);
    }

    std::cout << "

";
}

输出以下文本:

sizeof(char)    : 1
text            : olé
sizeof(text)    : 5
strlen(text)    : 4
text(ordinals)  : 111 108 195 169

sizeof(wchar_t) : 4
wtext           : UNABLE TO CONVERT NATIVELY.
wtext           : ol�
sizeof(wtext)   : 16
wcslen(wtext)   : 3
wtext(ordinals) : 111 108 233

你会看到olé"char 中的文本实际上由四个字符构成:110、108、195 和 169(不包括尾随零).(我会让你学习 wchar_t 代码作为练习)

You'll see the "olé" text in char is really constructed by four chars: 110, 108, 195 and 169 (not counting the trailing zero). (I'll let you study the wchar_t code as an exercise)

因此,当在 Linux 上使用 char 时,您通常应该在不知不觉中使用 Unicode.由于 std::stringchar 一起工作,所以 std::string 已经准备好 unicode.

So, when working with a char on Linux, you should usually end up using Unicode without even knowing it. And as std::string works with char, so std::string is already unicode-ready.

请注意,std::string 与 C 字符串 API 一样,会考虑olé"字符串有 4 个字符,而不是三个.因此,在截断/播放 unicode 字符时应谨慎,因为 UTF-8 中禁止某些字符组合.

Note that std::string, like the C string API, will consider the "olé" string to have 4 characters, not three. So you should be cautious when truncating/playing with unicode chars because some combination of chars is forbidden in UTF-8.

在 Windows 上,这有点不同.Win32 必须支持许多使用 char 和不同字符集/的应用程序代码页,在 Unicode 出现之前,全世界都已生成.

On Windows, this is a bit different. Win32 had to support a lot of application working with char and on different charsets/codepages produced in all the world, before the advent of Unicode.

所以他们的解决方案很有趣:如果应用程序使用 char,那么使用机器上的本地字符集/代码页对字符字符串进行编码/打印/显示在 GUI 标签上,不能很长时间是 UTF-8.例如,olé"将是olé"在法语本地化的 Windows 中,但在西里尔文本地化的 Windows 上会有所不同(olй",如果您使用 Windows-1251).因此,历史应用程序"通常仍会以同样的方式工作.

So their solution was an interesting one: If an application works with char, then the char strings are encoded/printed/shown on GUI labels using the local charset/codepage on the machine, which could not be UTF-8 for a long time. For example, "olé" would be "olé" in a French-localized Windows, but would be something different on an cyrillic-localized Windows ("olй" if you use Windows-1251). Thus, "historical apps" will usually still work the same old way.

对于基于 Unicode 的应用程序,Windows 使用 wchar_t,它是 2 字节宽,并以 UTF-16,这是在 2 字节字符上编码的 Unicode(或者至少是 UCS-2,它只是缺少代理对,因此缺少 BMP 之外的字符(>= 64K)).

For Unicode based applications, Windows uses wchar_t, which is 2-bytes wide, and is encoded in UTF-16, which is Unicode encoded on 2-bytes characters (or at the very least, UCS-2, which just lacks surrogate-pairs and thus characters outside the BMP (>= 64K)).

使用 char 的应用程序被称为多字节";(因为每个字形由一个或多个 char 组成),而使用 wchar_t 的应用程序被称为widechar".(因为每个字形由一两个 wchar_t 组成.参见 MultiByteToWideCharWideCharToMultiByte Win32 转换 API 了解更多信息.

Applications using char are said "multibyte" (because each glyph is composed of one or more chars), while applications using wchar_t are said "widechar" (because each glyph is composed of one or two wchar_t. See MultiByteToWideChar and WideCharToMultiByte Win32 conversion API for more info.

因此,如果您在 Windows 上工作,您非常希望使用 wchar_t(除非您使用隐藏它的框架,例如 GTKQT>...).事实是,在幕后,Windows 使用 wchar_t 字符串,所以即使是历史应用程序在使用 API 时也会将它们的 char 字符串转换为 wchar_tSetWindowText()(在 Win32 GUI 上设置标签的低级 API 函数).

Thus, if you work on Windows, you badly want to use wchar_t (unless you use a framework hiding that, like GTK or QT...). The fact is that behind the scenes, Windows works with wchar_t strings, so even historical applications will have their char strings converted in wchar_t when using API like SetWindowText() (low level API function to set the label on a Win32 GUI).

UTF-32 是每个字符 4 个字节,所以没有什么可添加的,只要 UTF-8 文本和 UTF-16 文本总是比 UTF-32 文本使用更少或相同的内存量(通常更少).

UTF-32 is 4 bytes per characters, so there is no much to add, if only that a UTF-8 text and UTF-16 text will always use less or the same amount of memory than an UTF-32 text (and usually less).

如果存在内存问题,那么您应该知道,与大多数西方语言相比,UTF-8 文本将比相同的 UTF-16 文本使用更少的内存.

If there is a memory issue, then you should know than for most western languages, UTF-8 text will use less memory than the same UTF-16 one.

不过,对于其他语言(中文、日语等),使用的内存与 UTF-8 相同,或者比 UTF-16 稍大.

Still, for other languages (chinese, japanese, etc.), the memory used will be either the same, or slightly larger for UTF-8 than for UTF-16.

总而言之,UTF-16 每个字符将主要使用 2 个字节,偶尔使用 4 个字节(除非您正在处理某种深奥的语言字形(克林贡语?精灵语?),而 UTF-8 将花费 1 到 4 个字节)字节.

All in all, UTF-16 will mostly use 2 and occassionally 4 bytes per characters (unless you're dealing with some kind of esoteric language glyphs (Klingon? Elvish?), while UTF-8 will spend from 1 to 4 bytes.

参见 https://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16 了解更多信息.

See https://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16 for more info.

  1. 什么时候我应该使用 std::wstring 而不是 std::string?

在 Linux 上?几乎从不 (§).在 Windows 上?几乎总是 (§).关于跨平台代码?取决于您的工具包...

On Linux? Almost never (§). On Windows? Almost always (§). On cross-platform code? Depends on your toolkit...

(§) : 除非你使用工具包/框架另有说明

(§) : unless you use a toolkit/framework saying otherwise

std::string 可以保存所有 ASCII 字符集,包括特殊字符吗?

Can std::string hold all the ASCII character set including special characters?

注意:std::string 适用于保存二进制"缓冲区,而 std::wstring 则不适用于!

Notice: A std::string is suitable for holding a 'binary' buffer, where a std::wstring is not!

在 Linux 上?是的.在 Windows 上?仅适用于 Windows 用户当前区域设置的特殊字符.

On Linux? Yes. On Windows? Only special characters available for the current locale of the Windows user.

编辑(根据 Johann Gerell 的评论):std::string 足以处理所有基于 char 的字符串(每个 char 是一个从 0 到 255 的数字).但是:

Edit (After a comment from Johann Gerell): a std::string will be enough to handle all char-based strings (each char being a number from 0 to 255). But:

  1. ASCII 应该从 0 到 127.更高的 char 不是 ASCII.
  2. 从 0 到 127 的 char 将被正确保存
  3. 从 128 到 255 的 char 将根据您的编码(unicode、非 unicode 等)具有含义,但它能够保存所有 Unicode 字形,只要它们是以 UTF-8 编码.
  1. ASCII is supposed to go from 0 to 127. Higher chars are NOT ASCII.
  2. a char from 0 to 127 will be held correctly
  3. a char from 128 to 255 will have a signification depending on your encoding (unicode, non-unicode, etc.), but it will be able to hold all Unicode glyphs as long as they are encoded in UTF-8.

  • 几乎所有流行的 C++ 编译器都支持 std::wstring 吗?

    大多数情况下,移植到 Windows 的基于 GCC 的编译器除外.它适用于我的 g++ 4.3.2(在 Linux 下),并且我从 Visual C++ 6 开始在 Win32 上使用 Unicode API.

    Mostly, with the exception of GCC based compilers that are ported to Windows. It works on my g++ 4.3.2 (under Linux), and I used Unicode API on Win32 since Visual C++ 6.

    什么是宽字符?

    在 C/C++ 上,它是一种写成 wchar_t 的字符类型,它比简单的 char 字符类型大.它应该用于放置索引(如 Unicode 字形)大于 255(或 127,取决于...)的字符.

    On C/C++, it's a character type written wchar_t which is larger than the simple char character type. It is supposed to be used to put inside characters whose indices (like Unicode glyphs) are larger than 255 (or 127, depending...).

    这篇关于std::wstring VS std::string的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆