使用UTF8 [英] Working with UTF8

查看：75 发布时间：2020/5/3 4:06:45 c++ string boost locale utf

本文介绍了使用UTF8的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

使用std :: string和UTF8似乎是一个相当复杂的问题，我找不到关于执行和不执行的很好的解释.

It seems like a rather complicated issue to work with std::string and UTF8 and I cannot find a good explanation of do's and dont's.

如何在C ++中正确使用UTF8?真是令人困惑.

How can I properly work with UTF8 in C++? It is rather confusing.

我找到了boost::locale并且设置了全局语言环境:

I've found boost::locale and I set the global locale:

std::locale::global(boost::locale::generator()(""));

但是，在此之后我需要考虑什么，什么时候可以解决问题?可以从文件中进行写入/读取工作，按预期进行字符串比较等吗?

However, after this what do I need to think about, when can I get problems? Will writing/reading from file work as expected, string comparisons etc...?

到目前为止，我知道以下情况:

So far I'm aware of the following:

std::regex/boost::regex将不起作用，需要转换为宽字符串并使用wregex.
boost::algorithm::to_upper将不起作用，需要使用boost::locale::to_upper

std::regex/boost::regex will not work, In need to covnert to wide strings and use wregex.
boost::algorithm::to_upper will not work, need to use boost::locale::to_upper

除此之外，我还需要了解什么?

Other than that what do I need to be aware of?

推荐答案

欢迎来到宏伟的Unicode世界.

Welcome in the magnificent world of Unicode.

对不起，wchar_t是实现定义的，通常在Windows上不足以为Asiatic脚本保留完整的代码点(例如)
您可以使用比较来进行查找，但是要对数据进行排序并将其呈现给受众，您需要完整的排序规则算法.例如，知道德语字典中的顺序不同于德语电话簿中的顺序(并且哭泣...)
通常来说，我建议不要自己转换字符串. Boost.Locale算法在包装 ICU 时通常应能正常工作，但应避免进行临时操作.
如果将字符串分成几部分，请不要在单词中间分开.将一个字符一分为二(由于变音符号，即使使用代码点感知算法也是如此)，或者避免这种情况，将两个字符一分为二是很容易的(因为某些文化将相邻字符的某些组合视为一个).

Sorry, wchar_t is implementation defined, and typically on Windows will not be sufficient to hold a full code-point for Asiatic scripts (for example)
You can use comparisons for look-up, but to sort data and present them to an audience you will need a full collation algorithm. Know for example that the order in the German dictionary is different from that in the German phone book (and cry...)
Generally speaking, I would advise not transforming the strings by yourself. Boost.Locale algorithms should work in general as they wrap ICU, but otherwise refrain from ad-hoc operations.
If you split the string in several parts, don't split in the middle of words. It's too easy to either split a character in two (even with code-point aware algorithms, because of diacritics), or even avoiding that, split between two characters (because some cultures consider certain combinations of adjacent characters as one).

这篇关于使用UTF8的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用UTF8 [英] Working with UTF8

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

使用UTF8 [英] Working with UTF8

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭