C++11 中的 Unicode 标识符和源代码? [英] Unicode Identifiers and Source Code in C++11?

查看:25
本文介绍了C++11 中的 Unicode 标识符和源代码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在新的 C++ 标准中找到

I find in the new C++ Standard

2.11 Identifiers                  [lex.name]
identifier:
    identifier-nondigit
    identifier identifier-nondigit
    identifier digit
identifier-nondigit:
    nondigit
    universal-character-name
    other implementation-defined character

带有附加文本

标识符是任意长的字母和数字序列.标识符中的每个universal-character-name 应指定一个字符,其编码在 ISO 10646 中属于指定范围之一在 E.1.[...]

An identifier is an arbitrarily long sequence of letters and digits. Each universal-character-name in an identifier shall designate a character whose encoding in ISO 10646 falls into one of the ranges specified in E.1. [...]

我不太明白这是什么意思.例如,从旧的标准中,我习惯于将通用字符名称"写为 u89ab.但是在标识符中使用那些......?真的吗?

I can not quite comprehend what this means. From the old std I am used to that a "universal character name" is written u89ab for example. But using those in an identifier...? Really?

新标准是否对 Unicode 更加开放?而且我没有提到新的文字类型 "uHello u89ab thing"u32,我想我明白了.但是:

Is the new standard more open w.r.t to Unicode? And I do not refer to the new Literal Types "uHello u89ab thing"u32, I think I understood those. But:

  • (可移植的)源代码是否可以采用任何 unicode 编码,例如 UTF-8、UTF-16 或任何(如何定义的)代码页?
  • 我可以写一个带有 u1234 的标识符吗myfuu1234ntion(用于任何目的)
  • 或者我可以使用 unicode 在 ICU 中定义的字符名称",即

  • Can (portable) source code be in any unicode encoding, like UTF-8, UTF-16 or any (how-ever-defined) codepage?
  • Can I write an identifier with u1234 in it myfuu1234ntion (for whatever purpose)
  • Or can i use the "character names" that unicode defines like in the ICU, i.e.

const auto x = "German BraunbU{LOWERCASE LETTER A WITH DIARESIS}r."u32;

甚至在源本身的标识符中?那将是一种享受...咳嗽...

or even in an identifier in the source itself? That would be a treat... cough...

我认为所有这些问题的答案都是,但我无法将其可靠地映射到标准中的措辞... :-)

I think the answer to all thise questions is no but I can not map this reliably to the wording in the standard... :-)

我发现2.2 翻译阶段 [lex.phases]",第一阶段:

I found "2.2 Phases of translation [lex.phases]", Phase 1:

如有必要,物理源文件字符以实现定义的方式映射到基本源字符集 [...].接受的物理源文件字符集是实现定义的.[...] 任何不在基本文件中的源文件字符源字符集 (2.3) 被指定该字符的通用字符名称替换.(实现可以使用任何内部编码,只要在源文件中遇到的实际扩展字符,以及源文件中作为通用字符名称(即,使用 uXXXX 符号)表示的相同扩展字符是同等处理,除非在原始字符串文字中恢复此替换.)

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set [...] if necessary. The set of physical source file characters accepted is implementation-defined. [...] Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)

通过阅读本文,我现在认为编译器可以选择接受 UTF-8、UTF-16 或它希望的任何代码页(通过元信息或用户配置).在阶段 1 中,它将其转换为 ASCII 形式(基本源字符集"),其中 Unicode 字符被替换为其 uNNNN 符号(或者编译器可以选择继续在它的 Unicode 表示,但必须确保它以相同的方式处理其他 uNNNN.

By reading this I now think, that a compiler may choose to accept UTF-8, UTF-16 or any codepage it wishes (by meta information or user configuration). In Phase 1 it translates this into an ASCII form ("basic source character set") in which then the Unicode-characters are replaced by its uNNNN notation (or the compiler can choose to continue to work in its Unicode-representation, but than has to make sure it handles the other uNNNN the same way.

你怎么看?

推荐答案

新标准是否对 Unicode 更加开放?

Is the new standard more open w.r.t to Unicode?

关于允许在标识符中使用通用字符名称,答案是否定的;在 C99 和 C++98 中,标识符中允许使用 UCN.然而,编译器直到最近才实现该特定要求.我认为 Clang 3.3 引入了对此的支持,并且 GCC 已经为此提供了一段时间的实验性功能.Herb Sutter 在他的 Build 2013 演讲C++ 的未来"中也提到了这个功能也将在某个时候出现在 VC++ 中.(尽管 IIRC Herb 将其称为 C++11 特性;它实际上是 C++98 特性.)

With respect to allowing universal character names in identifiers the answer is no; UCNs were allowed in identifiers back in C99 and C++98. However compilers did not implement that particular requirement until recently. Clang 3.3 I think introduces support for this and GCC has had an experimental feature for this for some time. Herb Sutter also mentioned during his Build 2013 talk "The Future of C++" that this feature would also be coming to VC++ at some point. (Although IIRC Herb refers to it as a C++11 feature; it is in fact a C++98 feature.)

预计不会使用 UCN 编写标识符.相反,预期的行为是使用源编码编写所需的字符.例如,源将如下所示:

It's not expected that identifiers will be written using UCNs. Instead the expected behavior is to write the desired character using the source encoding. E.g., source will look like:

long pörk;

不是:

long pu00F6rk;

然而,UCN 也可用于其他目的;并非所有编译器都需要接受相同的源代码编码,但现代编译器都支持某种编码方案,其中至少基本源字符具有相同的编码(即,现代编译器都支持某些 ASCII 兼容编码).

However UCNs are also useful for another purpose; Compilers are not all required to accept the same source encodings, but modern compilers all support some encoding scheme where at least the basic source characters have the same encoding (that is, modern compilers all support some ASCII compatible encoding).

UCN 允许您仅使用基本字符编写源代码,但仍然命名扩展字符.例如,这在编写字符串文字°"时很有用.在将被编译为 CP1252 和 UTF-8 的源代码中:

UCNs allow you to write source code with only the basic characters and yet still name extended characters. This is useful in, for example, writing a string literal "°" in source code that will be compiled both as CP1252 and as UTF-8:

char const *degree_sign = "u00b0";

这个字符串文字在多个编译器上被编码为适当的执行编码,即使源编码不同,只要编译器至少对基本字符共享相同的编码.

This string literal is encoded into the appropriate execution encoding on multiple compilers, even when the source encodings differ, as long as the compilers at least share the same encoding for basic characters.

(可移植的)源代码是否可以采用任何 unicode 编码,例如 UTF-8、UTF-16 或任何(如何定义的)代码页?

Can (portable) source code be in any unicode encoding, like UTF-8, UTF-16 or any (how-ever-defined) codepage?

标准不要求它,但大多数编译器会接受 UTF-8 源.Clang 支持 UTF-8 源(尽管它对字符和字符串文字中的非 UTF-8 数据有一些兼容性),gcc 允许指定源编码并包括对 UTF-8 的支持,并且VC++会猜测编码,可以猜测UTF-8.

It's not required by the standard, but most compilers will accept UTF-8 source. Clang supports only UTF-8 source (although it has some compatibility for non-UTF-8 data in character and string literals), gcc allows the source encoding to be specified and includes support for UTF-8, and VC++ will guess at the encoding and can be made to guess UTF-8.

(更新:VS2015 现在提供了一个选项来强制源和执行字符集为 UTF-8.)

(Update: VS2015 now provides an option to force the source and execution character sets to be UTF-8.)

我可以写一个带有 u1234 的标识符吗 myfuu1234ntion(出于任何目的)

Can I write an identifier with u1234 in it myfuu1234ntion (for whatever purpose)

是的,规范要求这样做,尽管正如我所说,并非所有编译器都实现了这一要求.

Yes, the specification mandates this, although as I said not all compilers implement this requirement yet.

或者我可以使用字符名称"吗?那个unicode定义就像在ICU中一样,即

Or can i use the "character names" that unicode defines like in the ICU, i.e.

const auto x = "German BraunbU{LOWERCASE LETTER A WITH DIARESIS}r."u32;

不,您不能使用 Unicode 长名称.

No, you cannot use Unicode long names.

甚至在源本身的标识符中?那将是一种享受......咳嗽......

or even in an identifier in the source itself? That would be a treat... cough...

如果编译器支持包含您想要的扩展字符的源代码编码,那么必须将源代码中逐字写入的字符与等效的 UCN 完全相同.所以是的,如果您使用的编译器支持 C++ 规范的这一要求,那么您可以直接在源代码中编写其源字符集中的任何字符,而无需费心编写 UCN.

If the compiler supports a source code encoding that contains the extended character you want then that character written literally in the source must be treated exactly the same as the equivalent UCN. So yes, if you use a compiler that supports this requirement of the C++ spec then you may write any character in its source character set directly in the source without bothering with writing UCNs.

这篇关于C++11 中的 Unicode 标识符和源代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆