使用UTF8 [英] Working with UTF8

查看:75
本文介绍了使用UTF8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用std :: string和UTF8似乎是一个相当复杂的问题,我找不到关于执行和不执行的很好的解释.

It seems like a rather complicated issue to work with std::string and UTF8 and I cannot find a good explanation of do's and dont's.

如何在C ++中正确使用UTF8?真是令人困惑.

How can I properly work with UTF8 in C++? It is rather confusing.

我找到了boost::locale并且设置了全局语言环境:

I've found boost::locale and I set the global locale:

std::locale::global(boost::locale::generator()(""));

但是,在此之后我需要考虑什么,什么时候可以解决问题?可以从文件中进行写入/读取工作,按预期进行字符串比较等吗?

However, after this what do I need to think about, when can I get problems? Will writing/reading from file work as expected, string comparisons etc...?

到目前为止,我知道以下情况:

So far I'm aware of the following:

  • std::regex/boost::regex将不起作用,需要转换为宽字符串并使用wregex.
  • boost::algorithm::to_upper将不起作用,需要使用boost::locale::to_upper
  • std::regex/boost::regex will not work, In need to covnert to wide strings and use wregex.
  • boost::algorithm::to_upper will not work, need to use boost::locale::to_upper

除此之外,我还需要了解什么?

Other than that what do I need to be aware of?

推荐答案

欢迎来到宏伟的Unicode世界.

Welcome in the magnificent world of Unicode.

  1. 对不起,wchar_t是实现定义的,通常在Windows上不足以为Asiatic脚本保留完整的代码点(例如)
  2. 您可以使用比较来进行查找,但是要对数据进行排序并将其呈现给受众,您需要完整的排序规则算法.例如,知道德语字典中的顺序不同于德语电话簿中的顺序(并且哭泣...)
  3. 通常来说,我建议不要自己转换字符串. Boost.Locale算法在包装 ICU 时通常应能正常工作,但应避免进行临时操作.
  4. 如果将字符串分成几部分,请不要在单词中间分开.将一个字符一分为二(由于变音符号,即使使用代码点感知算法也是如此),或者避免这种情况,将两个字符一分为二是很容易的(因为某些文化将相邻字符的某些组合视为一个).
  1. Sorry, wchar_t is implementation defined, and typically on Windows will not be sufficient to hold a full code-point for Asiatic scripts (for example)
  2. You can use comparisons for look-up, but to sort data and present them to an audience you will need a full collation algorithm. Know for example that the order in the German dictionary is different from that in the German phone book (and cry...)
  3. Generally speaking, I would advise not transforming the strings by yourself. Boost.Locale algorithms should work in general as they wrap ICU, but otherwise refrain from ad-hoc operations.
  4. If you split the string in several parts, don't split in the middle of words. It's too easy to either split a character in two (even with code-point aware algorithms, because of diacritics), or even avoiding that, split between two characters (because some cultures consider certain combinations of adjacent characters as one).

这篇关于使用UTF8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆