C ++中的跨平台字符串(和Unicode) [英] Cross-platform strings (and Unicode) in C++

查看:117
本文介绍了C ++中的跨平台字符串(和Unicode)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我终于回到我的主要任务-从Windows到Mac移植一个相当大的C ++项目。

So I've finally gotten back to my main task - porting a rather large C++ project from Windows to the Mac.

直接我被打中了wchar_t在Windows上为16位而在Mac上为32位的问题。这是一个问题,因为所有字符串均由wchar_t表示,并且Windows和Mac计算机之间将来回传输字符串数据(以磁盘数据形式和网络数据形式)。由于它的工作方式,在发送和接收数据之前将字符串转换为某种通用格式并不是完全简单的。

Straight away I've been hit by the problem where wchar_t is 16-bits on Windows but 32-bits on the Mac. This is a problem because all of the strings are represented by wchar_t and there will be string data going back and forth between Windows and Mac machines (in both on-disk data and network data forms). Because of the way in which it works it wouldn't be totally straightforward to convert the strings into some common format before sending and receiving the data.

我们也确实最近开始支持更多的语言,因此我们开始处理大量的Unicode数据(以及从右到左的语言)。

We've also really started to support a lot more languages recently and so we're starting to deal with a lot of Unicode data (as well as dealing with right-to-left languages).

现在,我可能会在这里混淆多个想法,并给自己造成比需要更多的问题,这就是为什么我问这个问题。我们认为将所有内存中字符串数据存储为UTF-8很有用。它解决了wchar_t大小不同的问题,这意味着我们可以轻松地支持多种语言,并且还大大减少了我们的内存占用(我们加载了很多-大多数是英语-字符串)-但似乎没有很多人在做这个。我们缺少什么吗?存在一个明显的问题,即字符串长度小于存储该字符串数据的内存大小。

Now, I could be conflating multiple ideas here and causing more problems for myself than needed which is why I'm asking this question. We're thinking that storing all of our in-memory string data as UTF-8 makes a lot of sense. It solves the wchar_t being different sizes problem, it means we can easily support multiple languages and it also dramatically reduces our memory footprint (we have a LOT of - mostly English - strings loaded) - but it doesn't seem like many people are doing this. Is there something we're missing? There's the obvious problem you have to deal with where string length can be less than the memory size storing that string data.

或者使用UTF-16是个好主意吗?还是我们应该坚持使用wchar_t并编写代码,以便在我们对磁盘或网络进行读写的地方在wchar_t和Unicode之间进行转换?

Or is using UTF-16 a better idea? Or should we stick to wchar_t and write code to convert between wchar_t and, say, Unicode in places where we read/write to the disk or the network?

我意识到了这一点危险地接近征求意见-但我们担心我们忽略了明显的内容,因为它看起来好像没有很多Unicode字符串类(例如)-但仍然有很多代码可以与Unicode相互转换就像在boost :: locale,iconv,utf-cpp和ICU中一样。

I realize this is dangerously close to asking for opinions - but we're nervous that we're overlooking something obvious because it doesn't seem like there are many Unicode string classes (for example) - but yet there's plenty of code for converting to/from Unicode like in boost::locale, iconv, utf-cpp and ICU.

推荐答案

在以下情况下,始终使用为字节定义的协议:文件或网络连接。不要依赖C ++编译器如何在内存中存储任何内容。对于Unicode文本,这意味着同时选择编码和字节顺序(好吧,UTF-8不在乎字节顺序)。即使您当前要支持的平台具有相似的体系结构,也可能会出现另一个行为不同的流行平台,甚至为您现有平台中的一个提供新的操作系统,并且您很高兴编写了可移植的代码。

Always use a protocol defined to the byte when a file or network connection is involved. Do not rely on how a C++ compiler stores anything in memory. For Unicode text, this means choosing both an encoding and a byte order (okay, UTF-8 doesn't care about byte order). Even if the platforms you currently want to support have similar architectures, another popular platform with different behavior or even a new OS for one of your existing platforms will likely come along, and you'll be glad you wrote portable code.

这篇关于C ++中的跨平台字符串(和Unicode)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆