如何在C ++中的UTF-8上正确使用std :: string? [英] How do I properly use std::string on UTF-8 in C++?
问题描述
我的平台是Mac和C ++ 11(或更高版本)。我是C ++初学者,正在从事处理中文和英文的个人项目。 UTF-8是此项目的首选编码。
My platform is a Mac and C++11 (or above). I'm a C++ beginner and working on a personal project which processes Chinese and English. UTF-8 is the preferred encoding for this project.
我读了Stack Overflow上的一些帖子,其中许多建议使用 std ::在处理UTF-8时使用字符串
并避免使用 wchar_t
,因为目前没有 char8_t
UTF-8。
I read some posts on Stack Overflow, and many of them suggest using std::string
when dealing with UTF-8 and avoid wchar_t
as there's no char8_t
right now for UTF-8.
但是,没有人谈论如何正确处理 str [i]
之类的函数。 , std :: string :: size()
, std :: string :: find_first_of()
或 std :: regex
,因为这些函数在面对UTF-8时通常会返回意外结果。
However, none of them talk about how to properly deal with functions like str[i]
, std::string::size()
, std::string::find_first_of()
or std::regex
as these function usually returns unexpected results when facing UTF-8.
我应该继续使用 std :: string
还是切换到 std :: wstring
?如果我应该呆在 std :: string
上,那么解决上述问题的最佳实践是什么?
Should I go ahead with std::string
or switch to std::wstring
? If I should stay with std::string
, what's the best practice for one to handle the above problems?
推荐答案
Unicode词汇表
Unicode是一个庞大而复杂的主题。我不希望在那儿走得太远,但是需要快速词汇表:
Unicode Glossary
Unicode is a vast and complex topic. I do not wish to wade too deep there, however a quick glossary is necessary:
- 代码点:代码点是基本的作为Unicode的构建块,代码点只是映射为意思的整数。整数部分可容纳32位(实际上是24位),其含义可以是字母,变音符号,空格,符号,笑脸,半个标志……甚至可以是
- 字素簇:字素簇是语义相关的代码点的组,例如unicode中的一个标志通过关联来表示。两个代码点;这两个中的每一个都没有任何意义,但是在一个词素簇中关联在一起,它们代表一个标志。在某些脚本中,词素簇也用于将字母和变音符号配对。
- Code Points: Code Points are the basic building blocks of Unicode, a code point is just an integer mapped to a meaning. The integer portion fits into 32 bits (well, 24 bits really), and the meaning can be a letter, a diacritic, a white space, a sign, a smiley, half a flag, ... and it can even be "the next portion reads right to left".
- Grapheme Clusters: Grapheme Clusters are groups of semantically related Code Points, for example a flag in unicode is represented by associating two Code Points; each of those two, in isolation, has no meaning, but associated together in a Grapheme Cluster they represent a flag. Grapheme Clusters are also used to pair a letter with a diacritic in some scripts.
这是Unicode的基础。代码点和字素簇之间的区别大部分可以被掩盖,因为对于大多数现代语言而言,每个字符都被包含在内。映射到单个代码点(常用的字母+音素组合有专用的重音形式)。不过,如果您冒险使用笑脸,旗帜等,那么您可能必须注意区别。
This is the basic of Unicode. The distinction between Code Point and Grapheme Cluster can be mostly glossed over because for most modern languages each "character" is mapped to a single Code Point (there are dedicated accented forms for commonly used letter+diacritic combinations). Still, if you venture in smileys, flags, etc... then you may have to pay attention to the distinction.
然后,必须对一系列Unicode Code Points进行编码;常见的编码是UTF-8,UTF-16和UTF-32,后两种以Little-Endian和Big-Endian形式存在,总共有5种常见编码。
Then, a serie of Unicode Code Points has to be encoded; the common encodings are UTF-8, UTF-16 and UTF-32, the latter two existing in both Little-Endian and Big-Endian forms, for a total of 5 common encodings.
在UTF-X中,X是代码单位的位大小,每个代码点根据其大小表示为一个或几个代码单位:
In UTF-X, X is the size in bits of the Code Unit, each Code Point is represented as one or several Code Units, depending on its magnitude:
- UTF-8:1至4个代码单位,
- UTF-16:1或2个代码单位,
- UTF-32:1个代码单位。
- 请勿使用
std: :wstring
如果您关心可移植性(在Windows上wchar_t
仅为16位);使用std :: u32string
代替(又名std :: basic_string< char32_t>
)。 - 内存中表示形式(
std :: string
或std :: wstring
)独立于磁盘上的表示形式(UTF-8,UTF-16或UTF-32),因此请做好准备在边界处进行转换(读取和写入)。 - 尽管32位
wchar_t
确保代码单位代表完整的代码点,但仍不代表完整的字素簇。
- Do not use
std::wstring
if you care about portability (wchar_t
is only 16 bits on Windows); usestd::u32string
instead (akastd::basic_string<char32_t>
). - The in-memory representation (
std::string
orstd::wstring
) is independent of the on-disk representation (UTF-8, UTF-16 or UTF-32), so prepare yourself for having to convert at the boundary (reading and writing). - While a 32-bits
wchar_t
ensures that a Code Unit represents a full Code Point, it still does not represent a complete Grapheme Cluster.
如果您只阅读或编写字符串,则 std :: string
或 std :: wstring
。
If you are only reading or composing strings, you should have no to little issues with std::string
or std::wstring
.
开始切片和切块时麻烦就开始了,那么您必须注意(1)代码点边界(在UTF-8或UTF-16中) (2)字素簇的边界。前者可以很容易地自己处理,后者需要使用Unicode感知库。
Troubles start when you start slicing and dicing, then you have to pay attention to (1) Code Point boundaries (in UTF-8 or UTF-16) and (2) Grapheme Clusters boundaries. The former can be handled easily enough on your own, the latter requires using a Unicode aware library.
如果要关注性能,则很有可能 std :: string
由于其内存占用较小,因此效果会更好;尽管大量使用中文可能会改变交易。像往常一样,配置文件。
If performance is a concern, it is likely that std::string
will perform better due to its smaller memory footprint; though heavy use of Chinese may change the deal. As always, profile.
如果字素簇不是问题,那么 std :: u32string
的优点是简化了事情:1代码单元-> 1代码点意味着您不会意外分割代码点,并且 std :: basic_string
的所有功能都可以直接使用。
If Grapheme Clusters are not a problem, then std::u32string
has the advantage of simplifying things: 1 Code Unit -> 1 Code Point means that you cannot accidentally split Code Points, and all the functions of std::basic_string
work out of the box.
如果您使用带有 std :: string
或 char *
/ char const *的软件进行连接
,然后坚持使用 std :: string
来回转换。否则会很痛苦。
If you interface with software taking std::string
or char*
/char const*
, then stick to std::string
to avoid back-and-forth conversions. It'll be a pain otherwise.
UTF-8在 std :: string
中实际上可以很好地工作。
UTF-8 actually works quite well in std::string
.
大多数操作都可以因为UTF-8编码是自同步的并且可以与ASCII向后兼容,所以可以直接使用。
Most operations work out of the box because the UTF-8 encoding is self-synchronizing and backward compatible with ASCII.
由于采用了编码代码点的方式,因此寻找代码点不会偶然匹配中间的另一个代码点:
Due the way Code Points are encoded, looking for a Code Point cannot accidentally match the middle of another Code Point:
-
str.find('\n')
有效, -
str.find( ...)
用于逐字节匹配 1 , -
str.find_first_of( \r\n)
在搜索时有效 ASCII字符。
str.find('\n')
works,str.find("...")
works for matching byte by byte1,str.find_first_of("\r\n")
works if searching for ASCII characters.
类似地, regex
通常应该是开箱即用的。由于字符序列( haha
)只是字节序列(哈
) ,基本的搜索模式应该可以立即使用。
Similarly, regex
should mostly works out of the box. As a sequence of characters ("haha"
) is just a sequence of bytes ("哈"
), basic search patterns should work out of the box.
但是请警惕字符类(例如 [:alphanum:]
),因为它取决于正则表达式的风格和实现,所以它可能匹配或可能不匹配Unicode字符。
Be wary, however, of character classes (such as [:alphanum:]
), as depending on the regex flavor and implementation it may or may not match Unicode characters.
同样,请警惕将中继器应用于非ASCII字符,哈?
只能认为最后一个字节是可选的;在这种情况下,请使用括号清楚地描述重复的字节序列:((哈)?
。
Similarly, be wary of applying repeaters to non-ASCII "characters", "哈?"
may only consider the last byte to be optional; use parentheses to clearly delineate the repeated sequence of bytes in such cases: "(哈)?"
.
1 查找的关键概念是归一化和归类;这会影响所有比较操作。 std :: string
将始终逐字节比较(并因此进行排序),而无需考虑特定于语言或用法的比较规则。如果需要处理完整的规范化/归类,则需要完整的Unicode库,例如ICU。
1 The key concepts to look-up are normalization and collation; this affects all comparison operations. std::string
will always compare (and thus sort) byte by byte, without regard for comparison rules specific to a language or a usage. If you need to handle full normalization/collation, you need a complete Unicode library, such as ICU.
这篇关于如何在C ++中的UTF-8上正确使用std :: string?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!