C ++ 11中的Unicode标识符和源代码? [英] Unicode Identifiers and Source Code in C++11?

查看:74
本文介绍了C ++ 11中的Unicode标识符和源代码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在新的C ++标准中找到

I find in the new C++ Standard

2.11 Identifiers                  [lex.name]
identifier:
    identifier-nondigit
    identifier identifier-nondigit
    identifier digit
identifier-nondigit:
    nondigit
    universal-character-name
    other implementation-defined character

带有附加文本


标识符是任意长度的字母和数字序列。标识符中的每个通用字符名称应指定一个字符,其字符在ISO 10646中的编码属于E.1中指定的
范围之一。 [...]

An identifier is an arbitrarily long sequence of letters and digits. Each universal-character-name in an identifier shall designate a character whose encoding in ISO 10646 falls into one of the ranges specified in E.1. [...]

我不太明白这是什么意思。从旧的标准输入法开始,我习惯于将通用字符名称写为 \u89ab 。但是在标识符中使用那些...?真的吗?

I can not quite comprehend what this means. From the old std I am used to that a "universal character name" is written \u89ab for example. But using those in an identifier...? Really?

新标准对Unicode是否更开放?而且我没有提到新的文字类型 uHello \u89ab事情 u32 ,我想我理解这些。但是:

Is the new standard more open w.r.t to Unicode? And I do not refer to the new Literal Types "uHello \u89ab thing"u32, I think I understood those. But:


  • (便携式)源代码可以采用任何unicode编码,例如UTF-8,UTF-16或任何(但定义的代码页?

  • 我可以在其中 \u1234 中写一个标识符吗? myfu\u1234ntion (出于任何目的)

  • 或者我可以使用unicode定义的字符名称,例如在ICU中,即

  • Can (portable) source code be in any unicode encoding, like UTF-8, UTF-16 or any (how-ever-defined) codepage?
  • Can I write an identifier with \u1234 in it myfu\u1234ntion (for whatever purpose)
  • Or can i use the "character names" that unicode defines like in the ICU, i.e.

const auto x = "German Braunb\U{LOWERCASE LETTER A WITH DIARESIS}r."u32;

还是在源代码本身的标识符中?那会是一种享受... 咳嗽 ...

or even in an identifier in the source itself? That would be a treat... cough...

我认为答案是对所有这些问题的回答是,但是我不能可靠地将其映射到标准中的措辞...:-)

I think the answer to all thise questions is no but I can not map this reliably to the wording in the standard... :-)

编辑:我发现 2.2阶段翻译[lex.phases],阶段1:

I found "2.2 Phases of translation [lex.phases]", Phase 1:


物理源文件字符为必要时以实现定义的方式映射到基本源字符集。接受的物理源文件字符集是实现定义的。 [...]不在基本
源字符集中(2.3)中的任何源文件字符都由指定该字符的通用字符名称代替。 (一个实现可以使用任何内部编码,只要在源文件中遇到了实际的扩展字符,并且在源文件中以通用字符名表示相同的扩展字符(即使用\uXXXX表示法),

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set [...] if necessary. The set of physical source file characters accepted is implementation-defined. [...] Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)

通过阅读本文,我现在认为,编译器可以选择接受UTF-8,UTF-16或希望的任何代码页(通过元信息或用户配置)。在阶段1中,它将其转换为ASCII形式(基本源字符集),然后用 byuNNNN 表示法(或编译器)替换Unicode字符。可以选择继续以Unicode表示形式工作,但是不必确保以相同的方式处理其他 \uNNNN

By reading this I now think, that a compiler may choose to accept UTF-8, UTF-16 or any codepage it wishes (by meta information or user configuration). In Phase 1 it translates this into an ASCII form ("basic source character set") in which then the Unicode-characters are replaced by its \uNNNN notation (or the compiler can choose to continue to work in its Unicode-representation, but than has to make sure it handles the other \uNNNN the same way.

您怎么看?

推荐答案


新标准是否更加开放要使用Unicode吗?

Is the new standard more open w.r.t to Unicode?

关于在标识符中允许使用通用字符名称的答案是否;在C99和C ++ 98中,允许在标识符中使用UCN但是,直到最近编译器才实现该特定要求。Clang 3.3我认为引入了对此的支持,并且GCC对此具有实验性的功能已有一段时间。HerbSutter在其Build 2013演讲 C ++的未来中也提到了这一点。 "该功能也将在某个时候加入VC ++。(尽管IIRC她b将其称为C ++ 11功能; (实际上是C ++ 98的功能。)

With respect to allowing universal character names in identifiers the answer is no; UCNs were allowed in identifiers back in C99 and C++98. However compilers did not implement that particular requirement until recently. Clang 3.3 I think introduces support for this and GCC has had an experimental feature for this for some time. Herb Sutter also mentioned during his Build 2013 talk "The Future of C++" that this feature would also be coming to VC++ at some point. (Although IIRC Herb refers to it as a C++11 feature; it is in fact a C++98 feature.)

我们不希望标识符使用UCN编写。相反,预期的行为是使用源编码编写所需的字符。例如,来源看起来像这样:

It's not expected that identifiers will be written using UCNs. Instead the expected behavior is to write the desired character using the source encoding. E.g., source will look like:

long pörk;

不是:

long p\u00F6rk;

但是,UCN也可用于其他目的;并非全部都要求编译器接受相同的源编码,但是现代编译器均支持某种编码方案,其中至少基本源字符具有相同的编码(也就是说,现代编译器均支持某种ASCII兼容编码)。

However UCNs are also useful for another purpose; Compilers are not all required to accept the same source encodings, but modern compilers all support some encoding scheme where at least the basic source characters have the same encoding (that is, modern compilers all support some ASCII compatible encoding).

UCN允许您仅使用基本字符而仍使用扩展名来编写源代码。例如,这在写字符串文字°时很有用。在将被编译为CP1252和UTF-8的源代码中:

UCNs allow you to write source code with only the basic characters and yet still name extended characters. This is useful in, for example, writing a string literal "°" in source code that will be compiled both as CP1252 and as UTF-8:

char const *degree_sign = "\u00b0";

即使源编码不同,此字符串文字也会在多个编译器上编码为适当的执行编码。编译器至少对基本字符使用相同的编码。

This string literal is encoded into the appropriate execution encoding on multiple compilers, even when the source encodings differ, as long as the compilers at least share the same encoding for basic characters.


(便携式)源代码可以采用任何Unicode编码,例如UTF-8,UTF-16或任何(如何定义的)代码页?

Can (portable) source code be in any unicode encoding, like UTF-8, UTF-16 or any (how-ever-defined) codepage?

标准不要求这样做,但是大多数编译器都接受UTF-8源代码。 Clang仅支持 UTF-8源(尽管它与字符和字符串文字中的非UTF-8数据具有某些兼容性),gcc允许指定源编码并支持UTF-8,并且VC ++会猜测编码,并且可以猜测UTF-8。

It's not required by the standard, but most compilers will accept UTF-8 source. Clang supports only UTF-8 source (although it has some compatibility for non-UTF-8 data in character and string literals), gcc allows the source encoding to be specified and includes support for UTF-8, and VC++ will guess at the encoding and can be made to guess UTF-8.

(更新:VS2015现在提供了选项强制源和执行字符集为UTF-8。)

(Update: VS2015 now provides an option to force the source and execution character sets to be UTF-8.)


我可以在其myfu withu1234ntion中写上带有\u1234的标识符(出于任何目的)

Can I write an identifier with \u1234 in it myfu\u1234ntion (for whatever purpose)

是,规范要求这样做,

Yes, the specification mandates this, although as I said not all compilers implement this requirement yet.


或者我可以使用字符名称来代替。该unicode就像在ICU中那样定义,即

Or can i use the "character names" that unicode defines like in the ICU, i.e.

const auto x = "German Braunb\U{LOWERCASE LETTER A WITH DIARESIS}r."u32;


不,您不能使用Unicode长名称。

No, you cannot use Unicode long names.


还是在源代码本身的标识符中?那会是一种享受...咳嗽...

or even in an identifier in the source itself? That would be a treat... cough...

如果编译器支持包含所需扩展字符的源代码编码,则该字符将被编写从字面上看,必须将源代码与等效的UCN完全相同。所以是的,如果您使用支持C ++规范这一要求的编译器,则可以直接在源代码中将任何字符写入其源字符集中,而不必担心编写UCN。

If the compiler supports a source code encoding that contains the extended character you want then that character written literally in the source must be treated exactly the same as the equivalent UCN. So yes, if you use a compiler that supports this requirement of the C++ spec then you may write any character in its source character set directly in the source without bothering with writing UCNs.

这篇关于C ++ 11中的Unicode标识符和源代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆