在C ++源代码中使用Unicode [英] Using Unicode in C++ source code

查看:123
本文介绍了在C ++源代码中使用Unicode的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

C ++源代码的标准编码是什么? C ++标准甚至说了些什么吗?我可以用Unicode编写C ++源代码吗?

What is the standard encoding of C++ source code? Does the C++ standard even say something about this? Can I write C++ source in Unicode?

例如,我可以在注释中使用非ASCII字符,如汉字吗?如果是,是完全Unicode允许还是只是Unicode的一个子集? (例如,16位的第一页或任何它被调用。)

For example, can I use non-ASCII characters such as Chinese characters in comments? If so, is full Unicode allowed or just a subset of Unicode? (e.g., that 16-bit first page or whatever it's called.)

此外,我可以使用Unicode字符串吗?例如:

Furthermore, can I use Unicode for strings? For example:

Wstring str=L"Strange chars: â Țđ ě €€";


推荐答案

C ++中的编码是非常复杂的。这是我的理解。

Encoding in C++ is quite a bit complicated. Here is my understanding of it.

每个实现都必须支持基本源字符集的字符。这些包括在§2.2/ 1(C ++ 11中的§2.3/ 1)中列出的常见字符。这些字符应该都适合一个 char 。此外,实现必须支持一种使用通用字符名命名其他字符的方式,并且看起来像 \\\￿ \Uffffffff ,并可用于引用Unicode字符。它们的一个子集可用于标识符(在附件E中列出)。

Every implementation has to support characters from the basic source character set. These include common characters listed in §2.2/1 (§2.3/1 in C++11). These characters should all fit into one char. In addition implementations have to support a way to name other characters using a way called universal-character-names and look like \uffff or \Uffffffff and can be used to refer to Unicode characters. A subset of them are usable in identifiers (listed in Annex E).

这很好,但是从文件中的字符到源字符(在编译时使用)的映射是实现定义的。这构成所使用的编码。这是它的字面意思(C ++ 98版本):

This is all nice, but the mapping from characters in the file, to source characters (used at compile time) is implementation defined. This constitutes the encoding used. Here is what it says literally (C++98 version):


物理源文件字符是
映射,定义
方式,到基本源字符
set(为行结束指示符引入新行字符
)如果
必需。 Trigraph序列(2.3)
被相应的
单字符内部
表示替换。任何源文件
字符不在基本源中
字符集(2.2)被替换为
通用字符名,des-
将忽略该字符。 (An
实现可以使用任何内部的
编码,只要在
源文件中遇到一个实际的
扩展字符,以及在
源文件中表示的相同的扩展
字符源文件
作为通用字符名(即
使用\uXXXX标记),是
等价处理。)

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences (2.3) are replaced by corresponding single-character internal representations. Any source file character not in the basic source character set (2.2) is replaced by the universal-character-name that des- ignates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e. using the \uXXXX notation), are handled equivalently.)

对于gcc,您可以使用选项 -finput-charset = charset 更改它。此外,您可以更改在运行时用于重新设置值的执行字符。正确的选项是 -fexec-charset = charset for char(默认为 utf-8 )和 -fwide-exec-charset = charset (默认为 utf-16 utf- 32 ,具体取决于 wchar_t 的大小)。

For gcc, you can change it using the option -finput-charset=charset. Additionally, you can change the execution character used to represet values at runtime. The proper option for this is -fexec-charset=charset for char (it defaults to utf-8) and -fwide-exec-charset=charset (which defaults to either utf-16 or utf-32 depending on the size of wchar_t).

这篇关于在C ++源代码中使用Unicode的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆