在2018年使用C ++处理Unicode的正确方法? [英] The proper way to handle Unicode with C++ in 2018?

查看:45
本文介绍了在2018年使用C ++处理Unicode的正确方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试搜索stackoverflow来找到答案,但是我发现的问题和答案都在 10岁左右上,由于变化,我似乎无法在该主题上达成共识以及可能的进展.

I have tried searching stackoverflow to find an answer to this but the questions and answers I've found are around 10 years old and I can't seem to find consensus on the subject due to changes and possible progress.

在stl之外,我知道有几个应该用于处理unicode的库-

There are several libraries that I know of outside of the stl that are supposed to handle unicode-

stl有一些功能( wstring codecvt_utf8 ),但人们似乎对使用该方法有歧义因为他们处理的是UTF-16这个网站:(到处都是utf-8 )说不应该被使用,很多人在线似乎同意这一前提.

There are a few features of the stl (wstring,codecvt_utf8) that were included but people seem to be ambivalent about using because they deal with UTF-16 which this site: (utf-8 everywhere) says shouldn't be used and many people online seem agree with the premise.

我要寻找的唯一东西是能够使用unicode字符串完成4件事-

The only thing I'm looking for is the ability to do 4 things with a unicode strings-

  1. 将字符串读入内存
  2. 使用unicode或ascii使用正则表达式搜索字符串,连接或使用ascii + unicode数字或字符对其进行文本替换/格式化.
  3. 将不符合ascii范围的字符转换为ascii + Unicode码格式.
  4. 将字符串写到磁盘或发送到任何地方.

据我所知,icu可以处理更多内容.我想知道的是,是否有在Linux,Windows和MacOS上处理此问题的标准方法.

From what I can tell icu handles this and more. What I would like to know is if there is a standard way of handling this on Linux, Windows, and MacOS.

谢谢您的时间.

推荐答案

我将尝试在此处提出一些想法:

I will try to throw some ideas here:

  • 大多数C ++程序/程序员仅假设文本是几乎不透明的字节序列.UTF-8可能对此感到内gui,许多注释恢复到以下内容也就不足为奇了:不用担心Unicode,只需处理UTF-8编码的字符串

文件仅包含字节.此刻,如果您尝试在内部处理真正的Unicode代码点,则必须将其序列化为字节->在这里,UTF-8再次赢得了胜利

files only contains bytes. At a moment, if you try to internally process true Unicode code points, you will have to serialize that to bytes -> here again UTF-8 wins the point

一旦您走出了基本的多语言平面(16位代码点),事情就会变得越来越复杂.表情符号特别难以处理:表情符号后面可以带有变体选择器(用于文本的U + FE0E VARIATION SELECTOR-15(VS15)或用于emoji样式的U + FE0F VARIATION SELECTOR-16(VS16))以更改其显示样式,或多或少地使用了旧的 i bs ^ 曾在1970年的ascii中使用,当时有人想打印î.不仅如此,字符U + 1F3FB到U + 1F3FF还可以为102个人类表情符号提供皮肤颜色,这些表情符号分布在六个区块中:丁巴特,图释,其他符号,其他符号和象形文字,补充符号和象形文字以及运输和地图符号.

as soon as you go out of the Basic Multilingual Plane (16 bits code points), things become more and more complex. The emoji is specifically awful to process: an emoji can be followed by a variation selector (U+FE0E VARIATION SELECTOR-15 (VS15) for text or U+FE0F VARIATION SELECTOR-16 (VS16) for emoji-style) to alter its display style, more or less the old i bs ^ that was used in 1970 ascii when one wanted to print î. That's not all, the characters U+1F3FB to U+1F3FF are use to provide a skin color for 102 human emoji spread across six blocks: Dingbats, Emoticons, Miscellaneous Symbols, Miscellaneous Symbols and Pictographs, Supplemental Symbols and Pictographs, and Transport and Map Symbols.

这只是意味着最多3个连续的unicode代码点可以表示一个字形...因此,一个字符是一个 char32_t 的想法仍然是一个近似值

That simply means that up to 3 consecutive unicode code points can represent one single glyph... So the idea that one character is one char32_t is still an approximation

我的结论是Unicode 是一件复杂的事情,确实需要像ICU这样的专用库.当您只处理BMP时,可以尝试使用简单的工具,例如标准库的转换器,但是完全支持远远超出了此范围.

My conclusion is that Unicode is a complex thing, and really requires a dedicated library like ICU. You can try to use simple tools like the converters of the standard library when you only deal with the BMP, but full support is far beyond that.

顺便说一句:即使像Python之类的其他自称具有本机unicode支持(IMHO比当前C ++更好)的语言在某些方面也失败了:

BTW: even other languages like Python that pretend to have a native unicode support (which is IMHO far better than current C++ one) ofter fails on some part:

  • tkinter GUI库是标准IDLE Python工具,无法显示BMP之外的任何代码点
  • 除了核心语言支持(编解码器和unicodedata)以外,其他模块或标准库还专用于Unicode,并且Python包索引中还提供了其他模块(如emoji表情),因为标准库不能满足所有需求

因此,对Unicode的支持已经超过10年了,并且我真的不希望在未来10年里情况会好得多...

So support for Unicode is poor for more than 10 years, and I do not really hope that things will go much better in the next 10 years...

这篇关于在2018年使用C ++处理Unicode的正确方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆