std :: u16string,std :: u32string,std :: string,length(),size(),codepoints和characters [英] std::u16string, std::u32string, std::string, length(), size(), codepoints and characters
问题描述
我很高兴看到C ++ 11中的 std :: u16string
和 std :: u32string
,但我想知道为什么没有 std :: u8string
来处理UTF-8的情况。我的印象是 std :: string
是为UTF-8,但它似乎并没有做得很好。我的意思是,不 std :: string.length()
仍然返回字符串的缓冲区大小,而不是字符串中的字符数量
I'm happy to see the std::u16string
and std::u32string
in C++11, but I'm wondering why there is no std::u8string
to handle the UTF-8 case. I'm under the impression that std::string
is intended for UTF-8, but it doesn't seem to do it very well. What I mean is, doesn't std::string.length()
still return the size of the string's buffer rather than the number of characters in the string?
那么,为新的C ++ 11类定义的标准字符串的 length()
方法如何?它们返回字符串缓冲区的大小,代码点的数量或字符数(假设代理对是2个代码点,但是一个字符,如果我错了,请纠正我)。
So, how is the length()
method of the standard strings defined for the new C++11 classes? Do they return the size of the string's buffer, the number of codepoints, or the number of characters (assuming a surrogate pair is 2 code points, but one character. Please correct me if I'm wrong)?
size()
;不等于 length()
?
请参见 http://en.cppreference.com/w/cpp/string/basic_string/length
And what about size()
; isn't it equal to length()
?
See http://en.cppreference.com/w/cpp/string/basic_string/length for the source of my confusion.
所以,我想,我的基本问题是如何使用 std :: string
, std :: u16string
和 std :: u32string
,并正确区分缓冲区大小,编码点数和字符数?如果你使用标准迭代器,你是否在字节,代码点或字符上进行迭代?
So, I guess, my fundamental question is how does one use std::string
, std::u16string
, and std::u32string
and properly distinguish between buffer size, number of codepoints, and number of characters? If you use the standard iterators, are you iterating over bytes, codepoints, or characters?
推荐答案
u16string
和 u32string
不是新C ++ 11类。它们只是 std :: basic_string
的typedef char16_t
和 cha32_t
类型。
u16string
and u32string
are not "new C++11 classes". They're just typedefs of std::basic_string
for char16_t
and cha32_t
types.
长度
始终等于 size
任何 basic_string
。它是字符串中 T
的数字, T
是 basic_string
。
length
is always equal to size
for any basic_string
. It is the number of T
's in the string, where T
is the template type for the basic_string
.
basic_string
或形式。它没有代码点,字形,Unicode字符,Unicode标准化或任何类型的概念。它只是 T
的有序序列。唯一能识别Unicode的关于 u16string
和 u32string
的是它们使用<$ c $返回的类型c> u和 U
basic_string
is not Unicode aware in any way, shape, or form. It has no concept of codepoints, graphemes, Unicode characters, Unicode normalization, or anything of the kind. It is simply a ordered sequence of T
s. The only thing that is Unicode-aware about u16string
and u32string
is that they use the type returned by u""
and U""
literals. Thus, they can store Unicode-encoded strings, but they do nothing that requires knowledge of said encoding.
迭代器迭代 T
Iterators iterate over elements of T
, not "bytes, codepoints, or characters". If T
is char16_t
, then it will iterate over char16_t
s. If the string is UTF-16-encoded, then it is iterating over UTF-16 code units, not Unicode codepoints or bytes.
这篇关于std :: u16string,std :: u32string,std :: string,length(),size(),codepoints和characters的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!