跨平台C ++:使用本地字符串编码或跨平台标准化? [英] Cross-platform C++: Use the native string encoding or standardise across platforms?

查看:122
本文介绍了跨平台C ++:使用本地字符串编码或跨平台标准化?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们特别关注Windows和Linux开发,并提出了两种不同的方法,似乎都有其优点。 Windows中的自然Unicode字符串类型是UTF-16,在linux中是UTF-8。



我们无法决定最佳方法:


  1. 在所有我们的应用程序逻辑(和持久数据)中,对其中一个进行标准化,并使其他平台进行适当的转换。





解决方案


UTF-8 in linux。


这是现代Linux的大多数情况。实际编码取决于使用什么API或库。一些硬编码使用UTF-8。但有些读LC_ALL,LC_CTYPE或LANG环境变量来检测要使用的编码(如Qt库)。所以要小心。


我们无法决定最佳方法


像往常一样。



如果90%的代码是以平台特定的方式处理平台特定的API,使用平台特定的字符串。作为示例 - 设备驱动程序或本机iOS应用程序。



如果90%的代码是跨平台共享的复杂业务逻辑,显然最好使用相同的编码在所有平台上。例如 - 聊天客户端或浏览器。



在第二种情况下,您可以选择:




  • 使用提供字符串支持的跨平台库(例如Qt,ICU)

  • 使用裸指针(我认为std :: string也是裸指针 li>


如果使用字符串是应用程序的重要部分,为字符串选择一个好的库是一个好的举动。例如Qt有一个非常固定的类,涵盖99%的常见任务。不幸的是,我没有ICU的经验,但它也看起来很不错。



当使用某些库的字符串时,你需要关心编码只有使用外部库,平台API或通过网络(或磁盘)发送字符串。例如,许多Cocoa,C#或Qt(都有固体字符串支持)程序员对编码细节知之甚少(因为他们可以专注于他们的主要任务)。



我在处理字符串时的经验是特定的,所以我个人更喜欢裸指针。使用它们的代码非常便于移植(在某种意义上它可以很容易地在其他项目和平台中重用),因为外部依赖性较小。它非常简单和快速(但是一个可能需要一些经验和Unicode背景感觉)。



我同意裸指针的方法不是为大家。以下情况很好:




  • 您使用整个字符串并分割,搜索,比较是一项难得的任务

  • 您可以在所有组件中使用相同的编码,并且只有在使用平台API时才需要转换

  • 所有支持的平台都具有API:

    • 从编码转换为API中使用的编码

    • 从API编码转换为代码中使用的编码


  • 指针在您的团队中不是问题



具体的经验,它实际上是一个很常见的情况。



当使用裸指针时,最好选择将用于整个项目(或所有项目)的编码。



从我的角度来看,UTF-8是最终的赢家。如果你不能使用UTF-8 - 使用字符串库或平台API的字符串 - 它会节省你很多时间。



UTF-8的优点: / p>


  • 完全ASCII兼容。任何ASCII字符串都是有效的UTF-8字符串。

  • C std库对UTF-8字符串非常有用。 (*)

  • C ++ std库对UTF-8(std :: string和friends)非常有用。 (*)

  • 旧版程式码适用于UTF-8。

  • 相当平台支援UTF-8。

  • 使用UTF-8(因为它与ASCII兼容),调试变得更容易。

  • 没有小尾词/大尾词混乱。

  • 你不会捕捉到一个经典的错误哦,UTF-16不总是2字节?。



需要词法比较,变换情况(toUpper / toLower),改变规范化形式或类似的东西 - 如果你这样做 - 使用字符串库或平台API。



缺点是可疑的:




  • 中文(和其他具有大代码点数的符号)比UTF-16更小。
  • $ b $

    因此,我建议使用UTF-8作为通用编码对于不使用任何字符串库的项目。



    但是编码不是你唯一需要回答的问题。



    规范化这样的东西。简单来说,一些字母可以用几种方式表示 - 像一个字形或像不同字形的组合。这样的常见问题是大多数字符串比较函数将它们视为不同的符号。如果你在跨平台项目上工作,选择一个标准化形式作为标准是一个正确的举措。



    例如,如果用户密码包含йёжиг,则在Mac上输入时会有不同的表示(在UTF-8和UTF-16中) (大多使用标准化窗体D)和在Windows(大多喜欢标准化窗体C)。因此,如果用户在Windows下使用此类密码注册,则会在Mac下登录。



    此外,我不建议使用wchar_t windows代码作为UCS-2 / UTF-16字符类型)。 wchar_t的问题是没有与它相关联的编码。它只是一个抽象的宽字符,大于正常的字符(Windows上的16位,大多数* nix上的32位)。


    We are specifically eyeing Windows and Linux development, and have come up with two differing approaches that both seem to have their merits. The natural unicode string type in Windows is UTF-16, and UTF-8 in linux.

    We can't decide whether the best approach:

    1. Standardise on one of the two in all our application logic (and persistent data), and make the other platforms do the appropriate conversions

    2. Use the natural format for the OS for application logic (and thus making calls into the OS), and convert only at the point of IPC and persistence.

    To me they seem like they are both about as good as each other.

    解决方案

    and UTF-8 in linux.

    It's mostly true for modern Linux. Actually encoding depends on what API or library is used. Some hardcoded to use UTF-8. But some read LC_ALL, LC_CTYPE or LANG environment variables to detect encoding to use (like Qt library). So be careful.

    We can't decide whether the best approach

    As usual it depends.

    If 90% of code is to deal with platform specific API in platform specific way, obviously it is better to use platform specific strings. As an example - a device driver or native iOS application.

    If 90% of code is complex business logic that is shared across platforms, obviously it is better to use same encoding on all platforms. As an example - chat client or browser.

    In second case you have a choice:

    • Use cross platform library that provides strings support (Qt, ICU, for example)
    • Use bare pointers (I consider std::string a "bare pointer" too)

    If working with strings is a significant part of your application, choosing a nice library for strings is a good move. For example Qt has a very solid set of classes that covers 99% of common tasks. Unfortunately, I has no ICU experience, but it also looks very nice.

    When using some library for strings you need to care about encoding only when working with external libraries, platform API or sending strings over the net (or disk). For example, a lot of Cocoa, C# or Qt (all has solid strings support) programmers know very little about encoding details (and it is good, since they can focus on their main task).

    My experience in working with strings is a little specific, so I personally prefer bare pointers. Code that use them is very portable (in sense it can be easily reused in other projects and platforms) because has less external dependencies. It is extremely simple and fast also (but one probably need some experience and Unicode background to feel that).

    I agree that bare pointers approach is not for everyone. It is good when:

    • You work with entire strings and splitting, searching, comparing is a rare task
    • You can use same encoding in all components and need a conversion only when using platform API
    • All your supported platforms has API to:
      • Convert from your encoding to that is used in API
      • Convert from API encoding to that is used in your code
    • Pointers is not a problem in your team

    From my a little specific experience it is actually a very common case.

    When working with bare pointers it is good to choose encoding that will be used in entire project (or in all projects).

    From my point of view, UTF-8 is an ultimate winner. If you can't use UTF-8 - use strings library or platform API for strings - it will save you a lot of time.

    Advantages of UTF-8:

    • Fully ASCII compatible. Any ASCII string is a valid UTF-8 string.
    • C std library works great with UTF-8 strings. (*)
    • C++ std library works great with UTF-8 (std::string and friends). (*)
    • Legacy code works great with UTF-8.
    • Quite any platform supports UTF-8.
    • Debugging is MUCH easier with UTF-8 (since it is ASCII compatible).
    • No Little-Endian/Big-Endian mess.
    • You will not catch a classical bug "Oh, UTF-16 is not always 2 bytes?".

    (*) Until you need to lexical compare them, transform case (toUpper/toLower), change normalization form or something like this - if you do - use strings library or platform API.

    Disadvantage is questionable:

    • Less compact for Chinese (and other symbols with large code point numbers) than UTF-16.
    • Harder (a little actually) to iterate over symbols.

    So, I recommend to use UTF-8 as common encoding for project(s) that doesn't use any strings library.

    But encoding is not the only question you need to answer.

    There is such thing as normalization. To put it simple, some letters can be represented in several ways - like one glyph or like a combination of different glyphs. The common problem with this is that most of string compare functions treat them as different symbols. If you working on cross-platform project, choosing one of normalization forms as standard is a right move. This will save your time.

    For example if user password contains "йёжиг" it will be differently represented (in both UTF-8 and UTF-16) when entered on Mac (that mostly use Normalization Form D) and on Windows (that mostly likes Normalization Form C). So if user registered under Windows with such password it will a problem for him to login under Mac.

    In addition I would not recommend to use wchar_t (or use it only in windows code as a UCS-2/UTF-16 char type). The problem with wchar_t is that there is no encoding associated with it. It's just an abstract wide char that is larger than normal char (16 bits on Windows, 32 bits on most *nix).

    这篇关于跨平台C ++:使用本地字符串编码或跨平台标准化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆