为什么在使用UTF8的字符串文字中字符串化欧元符号不会产生UCN? [英] Why does stringizing an euro sign within a string literal using UTF8 not produce an UCN?

查看:466
本文介绍了为什么在使用UTF8的字符串文字中字符串化欧元符号不会产生UCN?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

规格说明在编译的第一阶段


任何不在基本来源字元集(2.3)由指定该字符的通用字符名称。


在第4阶段,表示


执行预处理指令,扩展宏调用


b


字符文字或字符串文字中的每个源字符集成员,以及字符文字或非原始字符串文字中的每个转义序列和通用字符名称是转换为执行字符集的相应成员


对于 ,我们在每个<$ c $之前插入


一个 \ c>和 \ 字符文字或字符串文字的字符(包括分隔

<$ p $


p> #define GET_UCN(X)#X
GET_UCN(€)

对于输入字符集为UTF-8(与我的文件的编码匹配),我预期下面的 #X 操作的预处理结果:\\\\\€\。 GCC,Clang和boost.wave不会将转换为UCN,而是产生\€\。我觉得我错过了一些东西。你能解释一下吗?

解决方案

这只是一个错误。 §2.1/ 1说明第一阶段,


(实现可以使用任何内部编码,源文件和在源文件中以通用字符名表示的相同扩展字符(即使用\uXXXX表示法)。)


这不是一个注释或脚注。



这个程序清楚地表明了故障:

  #include< iostream> 

#define GET_UCN(X)L ## #X

int main(){
std :: wcout< GET_UCN(€)<< '\\\
'<< GET_UCN(\\\€)<< '\\\
';
}

http://ideone.com/lb9jc



由于两个字符串都很宽,因此第一个字符串必须被破坏为多个字符,如果编译器无法解释输入多字节序列。在你给出的例子中,完全不支持UTF-8可能导致编译器直接回应序列。


The spec says that at phase 1 of compilation

Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character.

And at phase 4 it says

Preprocessing directives are executed, macro invocations are expanded

At phase 5, we have

Each source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set

For the # operator, we have

a \ character is inserted before each " and \ character of a character literal or string literal (including the delimiting " characters).

Hence I conducted the following test

#define GET_UCN(X) #X
GET_UCN("€")

With an input character set of UTF-8 (matching my file's encoding), I expected the following preprocessing result of the #X operation: "\"\\u20AC\"". GCC, Clang and boost.wave don't transform the into a UCN and instead yield "\"€\"". I feel like I'm missing something. Can you please explain?

解决方案

It's simply a bug. §2.1/1 says about Phase 1,

(An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e. using the \uXXXX notation), are handled equivalently.)

This is not a note or footnote. C++0x adds an exception for raw string literals, which might solve your problem at hand if you have one.

This program clearly demonstrates the malfunction:

#include <iostream>

#define GET_UCN(X) L ## #X

int main() {
std::wcout << GET_UCN("€") << '\n' << GET_UCN("\u20AC") << '\n';
}

http://ideone.com/lb9jc

Because both strings are wide, the first is required to be corrupted into several characters if the compiler fails to interpret the input multibyte sequence. In your given example, total lack of support for UTF-8 could cause the compiler to slavishly echo the sequence right through.

这篇关于为什么在使用UTF8的字符串文字中字符串化欧元符号不会产生UCN?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆