使用UTF8 [英] Working with UTF8
问题描述
使用std :: string和UTF8似乎是一个相当复杂的问题,我找不到关于执行和不执行的很好的解释.
It seems like a rather complicated issue to work with std::string and UTF8 and I cannot find a good explanation of do's and dont's.
如何在C ++中正确使用UTF8?真是令人困惑.
How can I properly work with UTF8 in C++? It is rather confusing.
我找到了boost::locale
并且设置了全局语言环境:
I've found boost::locale
and I set the global locale:
std::locale::global(boost::locale::generator()(""));
但是,在此之后我需要考虑什么,什么时候可以解决问题?可以从文件中进行写入/读取工作,按预期进行字符串比较等吗?
However, after this what do I need to think about, when can I get problems? Will writing/reading from file work as expected, string comparisons etc...?
到目前为止,我知道以下情况:
So far I'm aware of the following:
-
std::regex
/boost::regex
将不起作用,需要转换为宽字符串并使用wregex. -
boost::algorithm::to_upper
将不起作用,需要使用boost::locale::to_upper
std::regex
/boost::regex
will not work, In need to covnert to wide strings and use wregex.boost::algorithm::to_upper
will not work, need to useboost::locale::to_upper
除此之外,我还需要了解什么?
Other than that what do I need to be aware of?
推荐答案
欢迎来到宏伟的Unicode世界.
Welcome in the magnificent world of Unicode.
- 对不起,
wchar_t
是实现定义的,通常在Windows上不足以为Asiatic脚本保留完整的代码点(例如) - 您可以使用比较来进行查找,但是要对数据进行排序并将其呈现给受众,您需要完整的排序规则算法.例如,知道德语字典中的顺序不同于德语电话簿中的顺序(并且哭泣...)
- 通常来说,我建议不要自己转换字符串. Boost.Locale算法在包装 ICU 时通常应能正常工作,但应避免进行临时操作.
- 如果将字符串分成几部分,请不要在单词中间分开.将一个字符一分为二(由于变音符号,即使使用代码点感知算法也是如此),或者避免这种情况,将两个字符一分为二是很容易的(因为某些文化将相邻字符的某些组合视为一个).
- Sorry,
wchar_t
is implementation defined, and typically on Windows will not be sufficient to hold a full code-point for Asiatic scripts (for example) - You can use comparisons for look-up, but to sort data and present them to an audience you will need a full collation algorithm. Know for example that the order in the German dictionary is different from that in the German phone book (and cry...)
- Generally speaking, I would advise not transforming the strings by yourself. Boost.Locale algorithms should work in general as they wrap ICU, but otherwise refrain from ad-hoc operations.
- If you split the string in several parts, don't split in the middle of words. It's too easy to either split a character in two (even with code-point aware algorithms, because of diacritics), or even avoiding that, split between two characters (because some cultures consider certain combinations of adjacent characters as one).
这篇关于使用UTF8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!