std :: u16string,std :: u32string,std :: string,length(),size(),codepoints和characters [英] std::u16string, std::u32string, std::string, length(), size(), codepoints and characters

查看:352
本文介绍了std :: u16string,std :: u32string,std :: string,length(),size(),codepoints和characters的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很高兴看到C ++ 11中的 std :: u16string std :: u32string ,但我想知道为什么没有 std :: u8string 来处理UTF-8的情况。我的印象是 std :: string 是为UTF-8,但它似乎并没有做得很好。我的意思是,不 std :: string.length()仍然返回字符串的缓冲区大小,而不是字符串中的字符数量

I'm happy to see the std::u16string and std::u32string in C++11, but I'm wondering why there is no std::u8string to handle the UTF-8 case. I'm under the impression that std::string is intended for UTF-8, but it doesn't seem to do it very well. What I mean is, doesn't std::string.length() still return the size of the string's buffer rather than the number of characters in the string?

那么,为新的C ++ 11类定义的标准字符串的 length()方法如何?它们返回字符串缓冲区的大小,代码点的数量或字符数(假设代理对是2个代码点,但是一个字符,如果我错了,请纠正我)。

So, how is the length() method of the standard strings defined for the new C++11 classes? Do they return the size of the string's buffer, the number of codepoints, or the number of characters (assuming a surrogate pair is 2 code points, but one character. Please correct me if I'm wrong)?

size();不等于 length()
请参见 http://en.cppreference.com/w/cpp/string/basic_string/length

And what about size(); isn't it equal to length()? See http://en.cppreference.com/w/cpp/string/basic_string/length for the source of my confusion.

所以,我想,我的基本问题是如何使用 std :: string std :: u16string std :: u32string ,并正确区分缓冲区大小,编码点数和字符数?如果你使用标准迭代器,你是否在字节,代码点或字符上进行迭代?

So, I guess, my fundamental question is how does one use std::string, std::u16string, and std::u32string and properly distinguish between buffer size, number of codepoints, and number of characters? If you use the standard iterators, are you iterating over bytes, codepoints, or characters?

推荐答案

u16string u32string 不是新C ++ 11类。它们只是 std :: basic_string 的typedef char16_t cha32_t 类型。

u16string and u32string are not "new C++11 classes". They're just typedefs of std::basic_string for char16_t and cha32_t types.

长度始终等于 size 任何 basic_string 。它是字符串中 T 的数字, T basic_string

length is always equal to size for any basic_string. It is the number of T's in the string, where T is the template type for the basic_string.

basic_string 或形式。它没有代码点,字形,Unicode字符,Unicode标准化或任何类型的概念。它只是 T 的有序序列。唯一能识别Unicode的关于 u16string u32string 的是它们使用<$ c $返回的类型c> u和 U

basic_string is not Unicode aware in any way, shape, or form. It has no concept of codepoints, graphemes, Unicode characters, Unicode normalization, or anything of the kind. It is simply a ordered sequence of Ts. The only thing that is Unicode-aware about u16string and u32string is that they use the type returned by u"" and U"" literals. Thus, they can store Unicode-encoded strings, but they do nothing that requires knowledge of said encoding.

迭代器迭代 T

Iterators iterate over elements of T, not "bytes, codepoints, or characters". If T is char16_t, then it will iterate over char16_ts. If the string is UTF-16-encoded, then it is iterating over UTF-16 code units, not Unicode codepoints or bytes.

这篇关于std :: u16string,std :: u32string,std :: string,length(),size(),codepoints和characters的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆