std :: wstring VS std :: string [英] std::wstring VS std::string

查看:258
本文介绍了std :: wstring VS std :: string的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不能理解 std :: string std :: wstring 之间的区别。我知道 wstring 支持宽字符,如Unicode字符。我有以下问题:


  1. 我应该使用 std :: wstring std :: string

  2. 可以 std :: string

  3. 所有流行的C ++编译器都支持 std :: wstring

  4. 是什么? / div>

    string wstring



    std :: string 上模板化的a href =http://en.cppreference.com/w/cpp/string/basic_string> basic_string 上的 std :: wstring code> wchar_t



    char code> wchar_t



    char 应该包含一个字符一个1字节字符。
    wchar_t 应该包含一个宽字符,然后,事情变得棘手:在Linux上, wchar_t 为4字节,而在Windows上, 2字节



    Unicode ,那么?



    问题是 char wchar_t 直接与unicode绑定。



    b
    $ b

    让我们来看一个Linux操作系统:我的Ubuntu系统已经知道unicode。当我使用char字符串时,它在 UTF-8 (即Unicode字符串的字符)中进行本机编码。以下代码:

      #include< cstring> 
    #include< iostream>

    int main(int argc,char * argv [])
    {
    const char text [] =olé;


    std :: cout<< sizeof(char):< sizeof(char)< std :: endl;
    std :: cout<< text:<文本<< std :: endl;
    std :: cout<< sizeof(text):< sizeof(text)<< std :: endl;
    std :: cout<< strlen(text):< strlen(text)<< std :: endl;

    std :: cout<< text(bytes):;

    for(size_t i = 0,iMax = strlen(text); i {
    std :: cout< < static_cast< unsigned int>(
    static_cast< unsigned char>(text [i])
    );
    }

    std :: cout<< std :: endl< std :: endl;

    // - - -

    const wchar_t wtext [] = Lolé;

    std :: cout< sizeof(wchar_t):< sizeof(wchar_t)<< std :: endl;
    // std :: cout<< wtext:< wtext< std :: endl; < - error
    std :: cout<< wtext:UNABLE TO CONVERTNATIVELY。 << std :: endl;
    std :: wcout<< Lwtext:< wtext< std :: endl;

    std :: cout<< sizeof(wtext):< sizeof(wtext)< std :: endl;
    std :: cout<< wcslen(wtext):< wcslen(wtext)<< std :: endl;

    std :: cout<< wtext(bytes):;

    for(size_t i = 0,iMax = wcslen(wtext); i {
    std :: cout< < static_cast< unsigned int>(
    static_cast< unsigned short>(wtext [i])
    );
    }

    std :: cout<< std :: endl< std :: endl;

    return 0;
    }

    输出以下文本:

      sizeof(char):1 
    text:olé
    sizeof(text):5
    strlen text(bytes):111 108 195 169

    sizeof(wchar_t):4
    wtext:UNABLE TO CONVERT NATIVELY。
    wtext:ol
    sizeof(wtext):16
    wcslen(wtext):3
    wtext(bytes):111 108 233

    您会看到 char 中的olé文本真正由四个字符构成: 110,108,195和169(不计尾随零)。 (我将让你学习 wchar_t 代码作为练习)



    Linux,你通常最终会使用Unicode甚至不知道它。因为std :: string与char一起使用,所以std :: string已经是unicode就绪了。



    注意std :: string和C string API一样,考虑olé字符串有4个字符,而不是三个。因此,在截断/播放unicode字符时应谨慎,因为UTF-8中禁止使用某些字符组合。



    在Windows上?



    在Windows上,这有点不同。 Win32不得不支持许多应用程序使用 char 和不同的 charsets < a> / 代码页,在Unicode出现之前就已生成。



    所以他们的解决方案是有趣的:如果应用程序使用 char ,那么char字符串被编码/打印/显示在GUI标签上使用本地字符集/代码页。例如,在法语本地化的Windows中,olé将是olé,但是在西里尔语本地化的Windows上是olé(如果使用 Windows-1251 )。因此,历史应用通常仍然以同样的方式工作。



    对于基于Unicode的应用程序,Windows使用 wchar_t ,这是2个字节宽,并且在 UTF-16 中进行编码,这是在2位字节上编码的Unicode,字节字符(或至少,最兼容的UCS-2,这几乎是相同的事情IIRC)。



    使用 / code>是multibyte(因为每个字形由一个或多个 char 组成),而使用 wchar_t 表示widechar(因为每个字形由一个或两个 wchar_t 组成,请参阅MultiByteToWideChar WideCharToMultiByte Win32



    因此,如果你在Windows上工作,你非常想要使用 wchar_t (除非您使用框架隐藏,例如 GTK + QT ...)。事实是,在后台,Windows使用 wchar_t 字符串,所以即使历史应用程序也会转换它们的 char 当使用类似SetWindowText(低级API函数在Win32 GUI上设置标签)的API时, wchar_t



    内存问题?



    UTF-32是每个字符4个字节,所以没有什么可添加,如果只有一个UTF-8文本和UTF-



    如果有内存问题,那么你应该知道比大多数西方



    仍然,对于其他语言(中文,日语等),使用的内存对于UTF-8比UTF-16要大得多​​。



    总而言之,UTF-16每个字符大多使用2个字节(Klingon?Elvish?),而UTF-8将花费1到4个字节。



    请参阅http://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16 了解详情。 p>

    结论



    1。当我应该使用std :: wstring over std :: string?



    几乎从不(§)。

    在Windows上?几乎总是(§)。

    在横板格式代码?取决于您的工具包...


    $ b

    (§):除非您使用工具包/ 2。可以std :: string保存包括特殊字符的所有ASCII字符集?

    注意:std :: string适用于保存'二进制'缓冲区,其中std :: wstring不是!



    在Linux上?是。

    在Windows上?只有特殊字符可用于Windows用户的当前语言环境。



    编辑(在来自 Johann Gerell ):一个std :: string将足够处理所有基于字符的字符串(每个字符是从0到255的数字)。但是:


    1. ASCII应该从0到127.较高的字符不是ASCII。


    2. 根据您的编码(unicode,非unicode等),128到255之间的字符将具有意义,但它会能够保存所有Unicode字形,只要它们是以UTF-8编码的。

    3。几乎所有流行的C ++编译器都支持std :: wstring?

    大多数情况下,除了基于GCC的编译器被移植到Windows

    它适用于我的g ++ 4.3.2(在Linux下),我在Visual C ++ 6上使用Win32上的Unicode API。



    4。



    在C / C ++上,它是一个字符类型 wchar_t 大于简单的 char 字符类型。它应该用于放入其索引(如Unicode字形)大于255(或127,取决于...)的字符


    I am not able to understand the differences between std::string and std::wstring. I know wstring supports wide characters such as Unicode characters. I have got the following questions:

    1. When should I use std::wstring over std::string?
    2. Can std::string hold the entire ASCII character set, including the special characters?
    3. Is std::wstring supported by all popular C++ compilers?
    4. What is exactly a "wide character"?

    解决方案

    string? wstring?

    std::string is a basic_string templated on a char, and std::wstring on a wchar_t.

    char vs. wchar_t

    char is supposed to hold a character, usually a 1-byte character. wchar_t is supposed to hold a wide character, and then, things get tricky: On Linux, a wchar_t is 4-bytes, while on Windows, it's 2-bytes

    what about Unicode, then?

    The problem is that neither char nor wchar_t is directly tied to unicode.

    On Linux?

    Let's take a Linux OS: My Ubuntu system is already unicode aware. When I work with a char string, it is natively encoded in UTF-8 (i.e. Unicode string of chars). The following code:

    #include <cstring>
    #include <iostream>
    
    int main(int argc, char* argv[])
    {
       const char text[] = "olé" ;
    
    
       std::cout << "sizeof(char)    : " << sizeof(char) << std::endl ;
       std::cout << "text            : " << text << std::endl ;
       std::cout << "sizeof(text)    : " << sizeof(text) << std::endl ;
       std::cout << "strlen(text)    : " << strlen(text) << std::endl ;
    
       std::cout << "text(bytes)     :" ;
    
       for(size_t i = 0, iMax = strlen(text); i < iMax; ++i)
       {
          std::cout << " " << static_cast<unsigned int>(
                                  static_cast<unsigned char>(text[i])
                              );
       }
    
       std::cout << std::endl << std::endl ;
    
       // - - - 
    
       const wchar_t wtext[] = L"olé" ;
    
       std::cout << "sizeof(wchar_t) : " << sizeof(wchar_t) << std::endl ;
       //std::cout << "wtext           : " << wtext << std::endl ; <- error
       std::cout << "wtext           : UNABLE TO CONVERT NATIVELY." << std::endl ;
       std::wcout << L"wtext           : " << wtext << std::endl;
    
       std::cout << "sizeof(wtext)   : " << sizeof(wtext) << std::endl ;
       std::cout << "wcslen(wtext)   : " << wcslen(wtext) << std::endl ;
    
       std::cout << "wtext(bytes)    :" ;
    
       for(size_t i = 0, iMax = wcslen(wtext); i < iMax; ++i)
       {
          std::cout << " " << static_cast<unsigned int>(
                                  static_cast<unsigned short>(wtext[i])
                                  );
       }
    
       std::cout << std::endl << std::endl ;
    
       return 0;
    }
    

    outputs the following text:

    sizeof(char)    : 1
    text            : olé
    sizeof(text)    : 5
    strlen(text)    : 4
    text(bytes)     : 111 108 195 169
    
    sizeof(wchar_t) : 4
    wtext           : UNABLE TO CONVERT NATIVELY.
    wtext           : ol�
    sizeof(wtext)   : 16
    wcslen(wtext)   : 3
    wtext(bytes)    : 111 108 233
    

    You'll see the "olé" text in char is really constructed by four chars: 110, 108, 195 and 169 (not counting the trailing zero). (I'll let you study the wchar_t code as an exercise)

    So, when working with a char on Linux, you should usually end up using Unicode without even knowing it. And as std::string works with char, so std::string is already unicode-ready.

    Note that std::string, like the C string API, will consider the "olé" string to have 4 characters, not three. So you should be cautious when truncating/playing with unicode chars because some combination of chars is forbidden in UTF-8.

    On Windows?

    On Windows, this is a bit different. Win32 had to support a lot of application working with char and on different charsets/codepages produced in all the world, before the advent of Unicode.

    So their solution was an interesting one: If an application works with char, then the char strings are encoded/printed/shown on GUI labels using the local charset/codepage on the machine. For example, "olé" would be "olé" in a french-localized Windows, but would be something différent on an cyrillic-localized Windows ("olй" if you use Windows-1251). Thus, "historical apps" will usually still work the same old way.

    For Unicode based applications, Windows uses wchar_t, which is 2-bytes wide, and is encoded in UTF-16, which is Unicode encoded on 2-bytes characters (or at the very least, the mostly compatible UCS-2, which is almost the same thing IIRC).

    Applications using char are said "multibyte" (because each glyph is composed of one or more chars), while applications using wchar_t are said "widechar" (because each glyph is composed of one or two wchar_t. See MultiByteToWideChar and WideCharToMultiByte Win32 conversion API for more info.

    Thus, if you work on Windows, you badly want to use wchar_t (unless you use a framework hiding that, like GTK+ or QT...). The fact is that behind the scenes, Windows works with wchar_t strings, so even historical applications will have their char strings converted in wchar_t when using API like SetWindowText (low level API function to set the label on a Win32 GUI).

    Memory issues?

    UTF-32 is 4 bytes per characters, so there is no much to add, if only that a UTF-8 text and UTF-16 text will always use less or the same amount of memory than an UTF-32 text (and usually less).

    If there is a memory issue, then you should know than for most western languages, UTF-8 text will use less memory than the same UTF-16 one.

    Still, for other languages (chinese, japanese, etc.), the memory used will be either the same, or larger for UTF-8 than for UTF-16.

    All in all, UTF-16 will mostly use 2 bytes per characters (unless you're dealing with some kind of esoteric language glyphs (Klingon? Elvish?), while UTF-8 will spend from 1 to 4 bytes.

    See http://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16 for more info.

    Conclusion

    1. When I should use std::wstring over std::string?

    On Linux? Almost never (§).
    On Windows? Almost always (§).
    On cross-plateform code? Depends on your toolkit...

    (§) : unless you use a toolkit/framework saying otherwise

    2. Can std::string hold all the ASCII character set including special characters?

    Notice: A std::string is suitable for holding a 'binary' buffer, where a std::wstring is not!

    On Linux? Yes.
    On Windows? Only special characters available for the current locale of the Windows user.

    Edit (After a comment from Johann Gerell): a std::string will be enough to handle all char based strings (each char being a number from 0 to 255). But:

    1. ASCII is supposed to go from 0 to 127. Higher chars are NOT ASCII.
    2. a char from 0 to 127 will be held correctly
    3. a char from 128 to 255 will have a signification depending on your encoding (unicode, non-unicode, etc.), but it will be able to hold all Unicode glyphs as long as they are encoded in UTF-8.

    3. Is std::wstring supported by almost all popular C++ compilers?

    Mostly, with the exception of GCC based compilers that are ported to Windows
    It works on my g++ 4.3.2 (under Linux), and I used Unicode API on Win32 since Visual C++ 6.

    4. What is exactly a wide character?

    On C/C++, it's a character type written wchar_t which is larger than the simple char character type. It is supposed to be used to put inside characters whose indices (like Unicode glyphs) are larger than 255 (or 127, depending...)

    这篇关于std :: wstring VS std :: string的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆