内部和外部编码vs. Unicode [英] Internal and external encoding vs. Unicode

查看:228
本文介绍了内部和外部编码vs. Unicode的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于这个问题的评论中有许多海报传播了许多错误信息: C ++ ABI问题





  1. Linux是否使用UTF-8编码字符串?

  2. 外部编码如何与编码
  3. 实现定义。甚至应用程序定义;标准
    并不真正对应用程序对
    的使用设置任何限制,并且期望很多行为取决于语言环境。所有真正实现的
    是字符串
    中使用的编码。


  4. 在什么意义上。大多数操作系统忽略大多数编码;如果'\0'不是nul字节,但是即使EBCDIC满足
    要求,您也将$ b​​ $ b有问题。否则,根据上下文,将会有一些
    的附加字符,这可能是重要的(例如:路径名中的'/',例如
    );所有这些使用Unicode的前128个编码,因此
    将在UTF-8中具有单字节编码。作为一个例子,我使用
    UTF-8和ISO 8859-1在Linux下的文件名。唯一真正的
    问题是显示它们:如果你在 xterm 中做 ls ,例如
    ls 并且 xterm 将假定文件名与显示器处于相同的
    编码字体。


  5. 这主要取决于语言环境。根据区域设置,
    很可能是窄字符串的内部编码,而不是
    对应于字符串字面量的内部编码。 (但是如果不是这样,
    ,因为字符串文字的编码必须在
    编译时确定,其中作为窄字符
    字符串的内部编码取决于用于读取的语言环境它可以从一个
    字符串到下一个。)


新的应用程序在Linux,我强烈
推荐使用Unicode的一切,使用UTF-32宽字符
字符串和UTF-8窄字符串。但不要指望在
字符串前面的128个编码点之外的


Since there was a lot of missinformation spread by several posters in the comments for this question: C++ ABI issues list

I have created this one to clarify.

  1. What are the encodings used for C style strings?
  2. Is Linux using UTF-8 to encode strings?
  3. How does external encoding relate to the encoding used by narrow and wide strings?

解决方案

  1. Implementation defined. Or even application defined; the standard doesn't really put any restrictions on what an application does with them, and expects a lot of the behavior to depend on the locale. All that is really implemenation defined is the encoding used in string literals.

  2. In what sense. Most of the OS ignores most of the encodings; you'll have problems if '\0' isn't a nul byte, but even EBCDIC meets that requirement. Otherwise, depending on the context, there will be a few additional characters which may be significant (a '/' in path names, for example); all of these use the first 128 encodings in Unicode, so will have a single byte encoding in UTF-8. As an example, I've used both UTF-8 and ISO 8859-1 for filenames under Linux. The only real issue is displaying them: if you do ls in an xterm, for example, ls and the xterm will assume that the filenames are in the same encoding as the display font.

  3. That mainly depends on the locale. Depending on the locale, it's quite possible for the internal encoding of a narrow character string not to correspond to that used for string literals. (But how could it be otherwise, since the encoding of a string literal must be determined at compile time, where as the internal encoding for narrow character strings depends on the locale used to read it, and can vary from one string to the next.)

If you're developing a new application in Linux, I would strongly recommend using Unicode for everything, with UTF-32 for wide character strings, and UTF-8 for narrow character strings. But don't count on anything outside the first 128 encoding points working in string literals.

这篇关于内部和外部编码vs. Unicode的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆