std :: string和UTF-8编码的unicode [英] std::string and UTF-8 encoded unicode

查看:1035
本文介绍了std :: string和UTF-8编码的unicode的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我理解得很好,可以同时使用string和wstring存储UTF-8文本。




  • char,ASCII字符采用单个字节,一些中文字符采用3或4等。这意味着 str [3] 不一定指向第4个字符。 / p>


  • 使用 wchar_t 同样的东西,但是每个字符使用的最小字节数总是2 1用于 char ),而3或4字节宽的字符将占用2 wchar_t

    li>


对吗?



那么,如果我想使用 string :: find_first_of() string :: compare()等等?它会工作吗?字符串类处理的事实,字符有一个可变的大小?或者我应该只使用它们作为无功能的字节数组,在这种情况下,我宁愿去一个 wchar_t [] 缓冲区。



如果 std :: string 不处理,第二个问题:是否有库提供可以处理UTF-8编码的字符串类, str [3] 实际上指向第三个字符(从长度1到4的字节数组)?

解决方案

您在谈论Unicode。 Unicode使用32位来表示一个字符。然而,由于这是浪费内存,有更紧凑的编码。 UTF-8是一种这样的编码。它假定您使用字节单位,并将Unicode字符映射到1,2,3或4个字节。 UTF-16是另一种使用词作为单位,并将Unicode字符映射到1或2个字(2或4个字节)。
您可以同时使用encoding和string和wchar_t。对于英语文本/数字,UTF-8往往更加紧凑。



有些事情将工作,不管使用的编码和类型(比较)。然而,所有需要理解一个字符的函数都会被破坏。即,第5个字符不总是底层数组中的第5个条目。它可能看起来像是使用某些示例,但它最终会破裂。
string :: compare将工作,但不要期望得到字母顺序。这是语言依赖。
string :: find_first_of将适用于一些但不是全部。长字符串可能只是因为它们很长,而较短的字符串可能会被字符对齐混淆,并且很难找到错误。



最好的办法是找到一个库处理它,并忽略下面的类型(除非你有强烈的理由选择一个或另一个)。


If I understand well, it is possible to use both string and wstring to store UTF-8 text.

  • With char, ASCII characters take a single byte, some chinese characters take 3 or 4, etc. Which means that str[3] doesn't necessarily point to the 4th character.

  • With wchar_t same thing, but the minimal amount of bytes used per characters is always 2 (instead of 1 for char), and a 3 or 4 byte wide character will take 2 wchar_t.

Right ?

So, what if I want to use string::find_first_of() or string::compare(), etc with such a weirdly encoded string ? Will it work ? Does the string class handle the fact that characters have a variable size ? Or should I only use them as dummy feature-less byte arrays, in which case I'd rather go for a wchar_t[] buffer.

If std::string doesn't handle that, second question: are there libraries providing string classes that could handle that UTF-8 encoding so that str[3] actually points to the 3rd character (which would be a byte array from length 1 to 4) ?

解决方案

You are talking about Unicode. Unicode uses 32 bits to represent a character. However since that is wasting memory there are more compact encodings. UTF-8 is one such encoding. It assumes that you are using byte units and it maps Unicode characters to 1, 2, 3 or 4 bytes. UTF-16 is another that is using words as units and maps Unicode characters to 1 or 2 words (2 or 4 bytes). You can use both encoding with both string and wchar_t. UTF-8 tends to be more compact for english text/numbers.

Some things will work regardless of encoding and type used (compare). However all functions that need to understand one character will be broken. I.e the 5th character is not always the 5th entry in the underlying array. It might look like it's working with certain examples but It will eventually break. string::compare will work but do not expect to get alphabetical ordering. That is language dependent. string::find_first_of will work for some but not all. Long string will likely work just because they are long while shorter ones might get confused by character alignment and generate very hard to find bugs.

Best thing is to find a library that handles it for you and ignore the type underneath (unless you have strong reasons to pick one or the other).

这篇关于std :: string和UTF-8编码的unicode的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆