如何在C ++中的UTF-8上正确使用std :: string？ [英] How do I properly use std::string on UTF-8 in C++?

查看：124 发布时间：2020/9/26 23:27:25 c++ string c++11

本文介绍了如何在C ++中的UTF-8上正确使用std :: string？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的平台是Mac和C ++ 11（或更高版本）。我是C ++初学者，正在从事处理中文和英文的个人项目。 UTF-8是此项目的首选编码。

My platform is a Mac and C++11 (or above). I'm a C++ beginner and working on a personal project which processes Chinese and English. UTF-8 is the preferred encoding for this project.

我读了Stack Overflow上的一些帖子，其中许多建议使用 std ::在处理UTF-8时使用字符串并避免使用 wchar_t ，因为目前没有 char8_t UTF-8。

I read some posts on Stack Overflow, and many of them suggest using std::string when dealing with UTF-8 and avoid wchar_t as there's no char8_t right now for UTF-8.

但是，没有人谈论如何正确处理 str [i] 之类的函数。， std :: string :: size（）， std :: string :: find_first_of（）或 std :: regex ，因为这些函数在面对UTF-8时通常会返回意外结果。

However, none of them talk about how to properly deal with functions like str[i], std::string::size(), std::string::find_first_of() or std::regex as these function usually returns unexpected results when facing UTF-8.

我应该继续使用 std :: string 还是切换到 std :: wstring ？如果我应该呆在 std :: string 上，那么解决上述问题的最佳实践是什么？

Should I go ahead with std::string or switch to std::wstring? If I should stay with std::string, what's the best practice for one to handle the above problems?

Unicode词汇表

Unicode是一个庞大而复杂的主题。我不希望在那儿走得太远，但是需要快速词汇表：

Unicode Glossary

Unicode is a vast and complex topic. I do not wish to wade too deep there, however a quick glossary is necessary:

代码点：代码点是基本的作为Unicode的构建块，代码点只是映射为意思的整数。整数部分可容纳32位（实际上是24位），其含义可以是字母，变音符号，空格，符号，笑脸，半个标志……甚至可以是

字素簇：字素簇是语义相关的代码点的组，例如unicode中的一个标志通过关联来表示。两个代码点；这两个中的每一个都没有任何意义，但是在一个词素簇中关联在一起，它们代表一个标志。在某些脚本中，词素簇也用于将字母和变音符号配对。

Code Points: Code Points are the basic building blocks of Unicode, a code point is just an integer mapped to a meaning. The integer portion fits into 32 bits (well, 24 bits really), and the meaning can be a letter, a diacritic, a white space, a sign, a smiley, half a flag, ... and it can even be "the next portion reads right to left".
Grapheme Clusters: Grapheme Clusters are groups of semantically related Code Points, for example a flag in unicode is represented by associating two Code Points; each of those two, in isolation, has no meaning, but associated together in a Grapheme Cluster they represent a flag. Grapheme Clusters are also used to pair a letter with a diacritic in some scripts.

这是Unicode的基础。代码点和字素簇之间的区别大部分可以被掩盖，因为对于大多数现代语言而言，每个字符都被包含在内。映射到单个代码点（常用的字母+音素组合有专用的重音形式）。不过，如果您冒险使用笑脸，旗帜等，那么您可能必须注意区别。

This is the basic of Unicode. The distinction between Code Point and Grapheme Cluster can be mostly glossed over because for most modern languages each "character" is mapped to a single Code Point (there are dedicated accented forms for commonly used letter+diacritic combinations). Still, if you venture in smileys, flags, etc... then you may have to pay attention to the distinction.

然后，必须对一系列Unicode Code Points进行编码；常见的编码是UTF-8，UTF-16和UTF-32，后两种以Little-Endian和Big-Endian形式存在，总共有5种常见编码。

Then, a serie of Unicode Code Points has to be encoded; the common encodings are UTF-8, UTF-16 and UTF-32, the latter two existing in both Little-Endian and Big-Endian forms, for a total of 5 common encodings.

在UTF-X中，X是代码单位的位大小，每个代码点根据其大小表示为一个或几个代码单位：

In UTF-X, X is the size in bits of the Code Unit, each Code Point is represented as one or several Code Units, depending on its magnitude:

UTF-8：1至4个代码单位，

UTF-16：1或2个代码单位，

UTF-32：1个代码单位。

请勿使用 std：：wstring 如果您关心可移植性（在Windows上 wchar_t 仅为16位）；使用 std :: u32string 代替（又名 std :: basic_string< char32_t> ）。

内存中表示形式（ std :: string 或 std :: wstring ）独立于磁盘上的表示形式（UTF-8，UTF-16或UTF-32），因此请做好准备在边界处进行转换（读取和写入）。

尽管32位 wchar_t 确保代码单位代表完整的代码点，但仍不代表完整的字素簇。

Do not use std::wstring if you care about portability (wchar_t is only 16 bits on Windows); use std::u32string instead (aka std::basic_string<char32_t>).
The in-memory representation (std::string or std::wstring) is independent of the on-disk representation (UTF-8, UTF-16 or UTF-32), so prepare yourself for having to convert at the boundary (reading and writing).
While a 32-bits wchar_t ensures that a Code Unit represents a full Code Point, it still does not represent a complete Grapheme Cluster.

如果您只阅读或编写字符串，则 std :: string 或 std :: wstring 。

If you are only reading or composing strings, you should have no to little issues with std::string or std::wstring.

开始切片和切块时麻烦就开始了，那么您必须注意（1）代码点边界（在UTF-8或UTF-16中）（2）字素簇的边界。前者可以很容易地自己处理，后者需要使用Unicode感知库。

Troubles start when you start slicing and dicing, then you have to pay attention to (1) Code Point boundaries (in UTF-8 or UTF-16) and (2) Grapheme Clusters boundaries. The former can be handled easily enough on your own, the latter requires using a Unicode aware library.

如果要关注性能，则很有可能 std :: string 由于其内存占用较小，因此效果会更好；尽管大量使用中文可能会改变交易。像往常一样，配置文件。

If performance is a concern, it is likely that std::string will perform better due to its smaller memory footprint; though heavy use of Chinese may change the deal. As always, profile.

如果字素簇不是问题，那么 std :: u32string 的优点是简化了事情：1代码单元-> 1代码点意味着您不会意外分割代码点，并且 std :: basic_string 的所有功能都可以直接使用。

If Grapheme Clusters are not a problem, then std::u32string has the advantage of simplifying things: 1 Code Unit -> 1 Code Point means that you cannot accidentally split Code Points, and all the functions of std::basic_string work out of the box.

如果您使用带有 std :: string 或 char * / char const *的软件进行连接，然后坚持使用 std :: string 来回转换。否则会很痛苦。

If you interface with software taking std::string or char*/char const*, then stick to std::string to avoid back-and-forth conversions. It'll be a pain otherwise.

UTF-8在 std :: string 中实际上可以很好地工作。

UTF-8 actually works quite well in std::string.

大多数操作都可以因为UTF-8编码是自同步的并且可以与ASCII向后兼容，所以可以直接使用。

Most operations work out of the box because the UTF-8 encoding is self-synchronizing and backward compatible with ASCII.

由于采用了编码代码点的方式，因此寻找代码点不会偶然匹配中间的另一个代码点：

Due the way Code Points are encoded, looking for a Code Point cannot accidentally match the middle of another Code Point:

str.find（'\n'）有效，

str.find（ ...） 用于逐字节匹配 ¹，

str.find_first_of（ \r\n）在搜索时有效 ASCII字符。

str.find('\n') works,
str.find("...") works for matching byte by byte¹,
str.find_first_of("\r\n") works if searching for ASCII characters.

类似地， regex 通常应该是开箱即用的。由于字符序列（ haha ）只是字节序列（哈 ），基本的搜索模式应该可以立即使用。

Similarly, regex should mostly works out of the box. As a sequence of characters ("haha") is just a sequence of bytes ("哈"), basic search patterns should work out of the box.

但是请警惕字符类（例如 [：alphanum：] ），因为它取决于正则表达式的风格和实现，所以它可能匹配或可能不匹配Unicode字符。

Be wary, however, of character classes (such as [:alphanum:]), as depending on the regex flavor and implementation it may or may not match Unicode characters.

同样，请警惕将中继器应用于非ASCII字符，哈？ 只能认为最后一个字节是可选的；在这种情况下，请使用括号清楚地描述重复的字节序列：（（哈）？ 。

Similarly, be wary of applying repeaters to non-ASCII "characters", "哈?" may only consider the last byte to be optional; use parentheses to clearly delineate the repeated sequence of bytes in such cases: "(哈)?".

¹ 查找的关键概念是归一化和归类；这会影响所有比较操作。 std :: string 将始终逐字节比较（并因此进行排序），而无需考虑特定于语言或用法的比较规则。如果需要处理完整的规范化/归类，则需要完整的Unicode库，例如ICU。

¹ The key concepts to look-up are normalization and collation; this affects all comparison operations. std::string will always compare (and thus sort) byte by byte, without regard for comparison rules specific to a language or a usage. If you need to handle full normalization/collation, you need a complete Unicode library, such as ICU.

这篇关于如何在C ++中的UTF-8上正确使用std :: string？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在C ++中的UTF-8上正确使用std :: string？ [英] How do I properly use std::string on UTF-8 in C++?

问题描述

推荐答案

Unicode词汇表

Unicode Glossary

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

如何在C ++中的UTF-8上正确使用std :: string？ [英] How do I properly use std::string on UTF-8 in C++?

问题描述

推荐答案

Unicode词汇表

Unicode Glossary

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭