如何在C ++中的UTF-8上正确使用std :: string? [英] How do I properly use std::string on UTF-8 in C++?

查看:124
本文介绍了如何在C ++中的UTF-8上正确使用std :: string?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的平台是Mac和C ++ 11(或更高版本)。我是C ++初学者,正在从事处理中文和英文的个人项目。 UTF-8是此项目的首选编码。

My platform is a Mac and C++11 (or above). I'm a C++ beginner and working on a personal project which processes Chinese and English. UTF-8 is the preferred encoding for this project.

我读了Stack Overflow上的一些帖子,其中许多建议使用 std ::在处理UTF-8时使用字符串并避免使用 wchar_t ,因为目前没有 char8_t UTF-8。

I read some posts on Stack Overflow, and many of them suggest using std::string when dealing with UTF-8 and avoid wchar_t as there's no char8_t right now for UTF-8.

但是,没有人谈论如何正确处理 str [i] 之类的函数。 , std :: string :: size() std :: string :: find_first_of() std :: regex ,因为这些函数在面对UTF-8时通常会返回意外结果。

However, none of them talk about how to properly deal with functions like str[i], std::string::size(), std::string::find_first_of() or std::regex as these function usually returns unexpected results when facing UTF-8.

我应该继续使用 std :: string 还是切换到 std :: wstring ?如果我应该呆在 std :: string 上,那么解决上述问题的最佳实践是什么?

Should I go ahead with std::string or switch to std::wstring? If I should stay with std::string, what's the best practice for one to handle the above problems?

推荐答案

Unicode词汇表


Unicode是一个庞大而复杂的主题。我不希望在那儿走得太远,但是需要快速词汇表:

Unicode Glossary

Unicode is a vast and complex topic. I do not wish to wade too deep there, however a quick glossary is necessary:


  1. 代码点:代码点是基本的作为Unicode的构建块,代码点只是映射为意思的整数。整数部分可容纳32位(实际上是24位),其含义可以是字母,变音符号,空格,符号,笑脸,半个标志……甚至可以是

  2. 字素簇:字素簇是语义相关的代码点的组,例如unicode中的一个标志通过关联来表示。两个代码点;这两个中的每一个都没有任何意义,但是在一个词素簇中关联在一起,它们代表一个标志。在某些脚本中,词素簇也用于将字母和变音符号配对。

  1. Code Points: Code Points are the basic building blocks of Unicode, a code point is just an integer mapped to a meaning. The integer portion fits into 32 bits (well, 24 bits really), and the meaning can be a letter, a diacritic, a white space, a sign, a smiley, half a flag, ... and it can even be "the next portion reads right to left".
  2. Grapheme Clusters: Grapheme Clusters are groups of semantically related Code Points, for example a flag in unicode is represented by associating two Code Points; each of those two, in isolation, has no meaning, but associated together in a Grapheme Cluster they represent a flag. Grapheme Clusters are also used to pair a letter with a diacritic in some scripts.

这是Unicode的基础。代码点和字素簇之间的区别大部分可以被掩盖,因为对于大多数现代语言而言,每个字符都被包含在内。映射到单个代码点(常用的字母+音素组合有专用的重音形式)。不过,如果您冒险使用笑脸,旗帜等,那么您可能必须注意区别。

This is the basic of Unicode. The distinction between Code Point and Grapheme Cluster can be mostly glossed over because for most modern languages each "character" is mapped to a single Code Point (there are dedicated accented forms for commonly used letter+diacritic combinations). Still, if you venture in smileys, flags, etc... then you may have to pay attention to the distinction.

然后,必须对一系列Unicode Code Points进行编码;常见的编码是UTF-8,UTF-16和UTF-32,后两种以Little-Endian和Big-Endian形式存在,总共有5种常见编码。

Then, a serie of Unicode Code Points has to be encoded; the common encodings are UTF-8, UTF-16 and UTF-32, the latter two existing in both Little-Endian and Big-Endian forms, for a total of 5 common encodings.

在UTF-X中,X是代码单位的位大小,每个代码点根据其大小表示为一个或几个代码单位:

In UTF-X, X is the size in bits of the Code Unit, each Code Point is represented as one or several Code Units, depending on its magnitude:


  • UTF-8:1至4个代码单位,

  • UTF-16:1或2个代码单位,

  • UTF-32:1个代码单位。


  1. 请勿使用 std: :wstring 如果您关心可移植性(在Windows上 wchar_t 仅为16位);使用 std :: u32string 代替(又名 std :: basic_string< char32_t> )。

  2. 内存中表示形式( std :: string std :: wstring )独立于磁盘上的表示形式(UTF-8,UTF-16或UTF-32),因此请做好准备在边界处进行转换(读取和写入)。

  3. 尽管32位 wchar_t 确保代码单位代表完整的代码点,但仍不代表完整的字素簇。

  1. Do not use std::wstring if you care about portability (wchar_t is only 16 bits on Windows); use std::u32string instead (aka std::basic_string<char32_t>).
  2. The in-memory representation (std::string or std::wstring) is independent of the on-disk representation (UTF-8, UTF-16 or UTF-32), so prepare yourself for having to convert at the boundary (reading and writing).
  3. While a 32-bits wchar_t ensures that a Code Unit represents a full Code Point, it still does not represent a complete Grapheme Cluster.

如果您只阅读或编写字符串,则 std :: string std :: wstring

If you are only reading or composing strings, you should have no to little issues with std::string or std::wstring.

开始切片和切块时麻烦就开始了,那么您必须注意(1)代码点边界(在UTF-8或UTF-16中) (2)字素簇的边界。前者可以很容易地自己处理,后者需要使用Unicode感知库。

Troubles start when you start slicing and dicing, then you have to pay attention to (1) Code Point boundaries (in UTF-8 or UTF-16) and (2) Grapheme Clusters boundaries. The former can be handled easily enough on your own, the latter requires using a Unicode aware library.

如果要关注性能,则很有可能 std :: string 由于其内存占用较小,因此效果会更好;尽管大量使用中文可能会改变交易。像往常一样,配置文件。

If performance is a concern, it is likely that std::string will perform better due to its smaller memory footprint; though heavy use of Chinese may change the deal. As always, profile.

如果字素簇不是问题,那么 std :: u32string 的优点是简化了事情:1代码单元-> 1代码点意味着您不会意外分割代码点,并且 std :: basic_string 的所有功能都可以直接使用。

If Grapheme Clusters are not a problem, then std::u32string has the advantage of simplifying things: 1 Code Unit -> 1 Code Point means that you cannot accidentally split Code Points, and all the functions of std::basic_string work out of the box.

如果您使用带有 std :: string char * / char const *的软件进行连接,然后坚持使用 std :: string 来回转换。否则会很痛苦。

If you interface with software taking std::string or char*/char const*, then stick to std::string to avoid back-and-forth conversions. It'll be a pain otherwise.

UTF-8在 std :: string 中实际上可以很好地工作。

UTF-8 actually works quite well in std::string.

大多数操作都可以因为UTF-8编码是自同步的并且可以与ASCII向后兼容,所以可以直接使用。

Most operations work out of the box because the UTF-8 encoding is self-synchronizing and backward compatible with ASCII.

由于采用了编码代码点的方式,因此寻找代码点不会偶然匹配中间的另一个代码点:

Due the way Code Points are encoded, looking for a Code Point cannot accidentally match the middle of another Code Point:


  • str.find('\n')有效,

  • str.find( ...) 用于逐字节匹配 1

  • str.find_first_of( \r\n)在搜索时有效 ASCII字符

  • str.find('\n') works,
  • str.find("...") works for matching byte by byte1,
  • str.find_first_of("\r\n") works if searching for ASCII characters.

类似地, regex 通常应该是开箱即用的。由于字符序列( haha​​ )只是字节序列() ,基本的搜索模式应该可以立即使用。

Similarly, regex should mostly works out of the box. As a sequence of characters ("haha") is just a sequence of bytes ("哈"), basic search patterns should work out of the box.

但是请警惕字符类(例如 [:alphanum:] ),因为它取决于正则表达式的风格和实现,所以它可能匹配或可能不匹配Unicode字符。

Be wary, however, of character classes (such as [:alphanum:]), as depending on the regex flavor and implementation it may or may not match Unicode characters.

同样,请警惕将中继器应用于非ASCII字符,哈? 只能认为最后一个字节是可选的;在这种情况下,请使用括号清楚地描述重复的字节序列:((哈)?

Similarly, be wary of applying repeaters to non-ASCII "characters", "哈?" may only consider the last byte to be optional; use parentheses to clearly delineate the repeated sequence of bytes in such cases: "(哈)?".

1 查找的关键概念是归一化和归类;这会影响所有比较操作。 std :: string 将始终逐字节比较(并因此进行排序),而无需考虑特定于语言或用法的比较规则。如果需要处理完整的规范化/归类,则需要完整的Unicode库,例如ICU。

1 The key concepts to look-up are normalization and collation; this affects all comparison operations. std::string will always compare (and thus sort) byte by byte, without regard for comparison rules specific to a language or a usage. If you need to handle full normalization/collation, you need a complete Unicode library, such as ICU.

这篇关于如何在C ++中的UTF-8上正确使用std :: string?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆