C预处理器插入的空间 [英] Spaces inserted by the C preprocessor

查看:48
本文介绍了C预处理器插入的空间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我们得到了以下输入C代码:

Suppose we are given this input C code:

#define Y 20
#define A(x) (10+x+Y)

A(A(40))

gcc -E 输出类似于(10+(10 + 40 +20)+20)

gcc -E -traditional-cpp 输出,例如(10+(10 + 40 + 20 )+20)

为什么默认cpp在 40 之后插入空格?

Why the default cpp inserts the space after 40 ?

在哪里可以找到涵盖该逻辑的cpp的最详细说明?

Where can I find the most detailed specification of the cpp that covers that logic ?

推荐答案

C标准没有指定这种行为,因为预处理阶段的输出只是令牌和空白流。将令牌流序列化回字符串,这是 gcc -E 所做的事情,不是标准所要求甚至没有提及的,并且不构成翻译的一部分

The C standard doesn't specify this behaviour, since the output of the preprocessing phase is simply a stream of tokens and whitespace. Serializing the stream of tokens back into a character string, which is what gcc -E does, is not required or even mentioned by the standard, and does not form part of the translation processs specified by the standard.

在阶段3中,程序被分解为预处理令牌和空白字符序列。除了忽略空格的连接运算符和保留空格的字符串化运算符的结果之外,令牌也被固定,不再需要使用空格将它们分开。但是,需要空格,以便:

In phase 3, the program "is decomposed into preprocessing tokens and sequences of white-space characters." Aside from the result of the concatenation operator, which ignores whitespace, and the stringification operator, which preserves whitespace, tokens are then fixed and whitespace is no longer needed to separate them. However, the whitespace is needed in order to:


  • 解析预处理程序指令

  • 正确处理字符串化

流中的空白元素直到第7阶段才被消除,尽管它们在第4阶段结束后不再相关。

The whitespace elements in the stream are not eliminated until phase 7, although they are no longer relevant after phase 4 concludes.

Gcc能够产生各种对程序员有用的信息,但与标准中的任何内容都不对应。例如,转换的预处理程序阶段还可以使用 -M 选项之一生成对插入Makefile有用的依赖项信息。或者,可以使用 -S 选项输出易于阅读的编译代码版本。可以使用 -E 选项输出预处理程序的可编译版本,该版本与阶段4产生的令牌流大致相对应。这些输出格式均不受C标准的控制,C标准仅与实际执行程序有关。

Gcc is capable of producing a variety of information useful to programmers, but not corresponding to anything in the standard. For example, the preprocessor phase of the translation can also produce dependency information useful for inserting into a Makefile, using one of the -M options. Alternatively, a human-readable version of the compiled code can be output using the -S option. And a compilable version of the preprocessed program, roughly corresponding to the token stream produced by phase 4, can be output using the -E option. None of these output formats are in any way controlled by the C standard, which is only concerned with actually executing the program.

为了生成 -E 输出,gcc必须以不改变程序语义的格式序列化令牌和空格流。在某些情况下,如果流中的两个连续令牌没有彼此分开,则会被错误地粘合在一起,从而形成单个令牌,因此gcc必须采取一些预防措施。它实际上不能将空格插入正在处理的流中,但是当它响应 gcc -E 来呈现流时,并没有阻止它添加空格。

In order to produce the -E output, gcc must serialize the stream of tokens and whitespace in a format which does not change the semantics of the program. There are cases in which two consecutive tokens in the stream would be incorrectly glued together into a single token if they are not separated from each other, so gcc must take some precautions. It cannot actually insert whitespace into the stream being processed, but nothing stops it from adding whitespace when it presents the stream in response to gcc -E.

例如,如果您的示例中的宏调用被修改为

For example, if macro invocation in your example were modified to

A(A(0x40E))

然后,令牌流的天真输出将导致

then naive output of the token stream would result in

(10+(10+0x40E+20)+20)

,由于 0x40E + 20 是单个pp-数字令牌,因此无法编译为数字令牌,因此无法编译。 + 之前的空格可以防止这种情况的发生。

which could not be compiled because 0x40E+20 is a single pp-number token which cannot be converted into a numeric token. The space before the + prevents this from happening.

如果您尝试将预处理器实现为某种字符串转换,那么毫无疑问,您将在极端情况下遇到严重问题。正确的实现策略是按照标准中的指示先进行令牌化,然后对令牌流和空白流执行第4阶段的功能。

If you attempt to implement a preprocessor as some kind of string transformation, you will undoubtedly confront serious issues in the corner cases. The correct implementation strategy is to tokenize first, as indicated in the standard, and then perform phase 4 as a function on a stream of tokens and whitespace.

字符串化是特别重要的空格影响语义的有趣情况,它可以用来查看实际令牌流的外观。如果将 A(A(40))的扩展字符串化,则可以看到实际上没有插入空格:

Stringification is a particularly interesting case where whitespace affects semantics, and it can be used to see what the actual token stream looks like. If you stringify the expansion of A(A(40)), you can see that no whitespace was actually inserted:

$ gcc -E -x c - <<<'
#define Y 20
#define A(x) (10+x+Y)
#define Q_(x) #x
#define Q(x) Q_(x)         
Q(A(A(40)))'

"(10+(10+40+20)+20)"

对字符串化中空格的处理是准确地由标准指定:(第6.0.3.2节第2段,非常感谢John Bollinger找到了该规范。)

The handling of whitespace in stringification is precisely specified by the standard: (§6.10.3.2, paragraph 2, many thanks to John Bollinger for finding the specification.)


每参数的预处理标记
之间出现空格将成为字符串文字中的单个空格字符。删除组成参数的第一个预处理令牌之前和最后一个预处理令牌之后的空白。

Each occurrence of white space between the argument’s preprocessing tokens becomes a single space character in the character string literal. White space before the first preprocessing token and after the last preprocessing token composing the argument is deleted.

这里是一个更微妙的示例,其中附加了 gcc -E 输出中需要空白,但实际上并没有将其插入令牌流中(再次通过使用字符串化显示以产生实际令牌流来显示。) I (标识)宏用于允许将两个令牌插入令牌流而无需插入空格;如果您想使用宏来构成 #include 指令的参数,则这是一个有用的技巧(不建议这样做,但是可以做到)。

Here is a more subtle example where additional whitespace is required in the gcc -E output, but is not actually inserted into the token stream (again shown by using stringification to produce the real token stream.) The I (identify) macro is used to allow two tokens to be inserted into the token stream without intervening whitespace; that's a useful trick if you want to use macros to compose the argument to the #include directive (not recommended, but it can be done).

也许这对于预处理器可能是一个有用的测试用例:

Maybe this could be a useful test case for your preprocessor:

#define Q_(x) #x
#define Q(x) Q_(x)
#define I(x) x
#define C(x,...) x(__VA_ARGS__)
// Uncomment the following line to run the program
//#include <stdio.h>

char*quoted=Q(C(I(int)I(main),void){I(return)I(C(puts,quoted));});
C(I(int)I(main),void){I(return)I(C(puts,quoted));}

这是gcc -E的输出(最后就是好东西):

Here's the output of gcc -E (just the good stuff at the end):

$ gcc -E squish.c | tail -n2
char*quoted="intmain(void){returnputs(quoted);}";
int main(void){return puts(quoted);}

在令牌流中这是从阶段4传递过来的,令牌 int main 不用空格分隔(<$也不c $ c>返回和投入)。字符串清楚地表明了这一点,其中没有空格分隔令牌。但是,即使通过 gcc -E 显式传递,程序也可以正常执行:

In the token stream which is passed out of phase 4, the tokens int and main are not separated by whitespace (and neither are return and puts). That's clearly shown by the stringification, in which no whitespace separates the token. However, the program compiles and executes fine, even if passed explicitly through gcc -E:

$ gcc -E squish.c | gcc -x c - && ./a.out 
intmain(void){returnputs(quoted);}

和编译 gcc -E 的输出。

不同的编译器和同一编译器的不同版本可能会产生预处理程序的不同序列化。因此,我认为您不会找到任何可与给定编译器的 -E 输出进行逐字符比较的可测试算法。

Different compilers and different versions of the same compiler may produce different serializations of a preprocessed program. So I don't think you will find any algorithm which is testable with a character-by-character comparison with the -E output of a given compiler.

最简单的可能的序列化算法是无条件输出两个连续标记之间的空格。显然,这会输出不必要的空间,但不会从语法上改变程序。

The simplest possible serialization algorithm would be to unconditionally output a space between two consecutive tokens. Obviously, that would output unnecessary spaces, but it would never syntactically alter the program.

我认为最小空间算法是在DFA结束时记录DFA状态。令牌中的最后一个字符,以便以后如果在后续令牌的第一个字符上的第一个令牌末尾存在从状态开始的过渡,则可以在两个连续令牌之间输出空格。 (将DFA状态保留为令牌的一部分与将令牌类型保留为令牌的一部分没有本质上的区别,因为您可以从DFA状态的简单查找中得出令牌类型。)该算法不会在之后插入空格在原始测试用例中 40 ,但是它将在 0x40E 之后插入一个空格。因此,这不是您的gcc版本所使用的算法。

I think the minimal space algorithm would be to record the DFA state at the end of the last character in a token so that you can later output a space between two consecutive tokens if there exists a transition from the state at the end of the first token on the first character of the following token. (Keeping the DFA state as part of the token is not intrinsically different from keeping the token type as part of the token, since you can derive the token type from a simple lookup from the DFA state.) That algorithm would not insert a space after 40 in your original test case, but it would insert a space after 0x40E. So it is not the algorithm being used by your version of gcc.

如果使用上述算法,则需要重新扫描由令牌级联创建的令牌。但是,这仍然是必需的,因为如果串联的结果不是有效的预处理令牌,则需要标记错误。

If you use the above algorithm, you will need to rescan tokens created by token concatenation. However, that is necessary anyway, because you need to flag an error if the result of the concatenation is not a valid preprocessing token.

如果您不想记录状态(尽管我说过,这样做基本上没有成本),并且您不想通过在输出令牌时重新扫描令牌来重新生成状态(这也很便宜),您可以预先计算两个由令牌类型和后续字符作为键的维布尔数组。计算基本上与上述操作相同:对于每个返回特定令牌类型的接受DFA状态,请在该令牌类型的数组中输入一个真值,以及任何具有从DFA状态转换过来的字符。然后,您可以查询令牌的令牌类型和后续令牌的第一个字符,以查看是否需要空格。该算法不会产生最小间距的输出:例如,您会在示例中将 40 后面的空格,因为 40 pp-number ,并且某些 pp-number 可以扩展为 + (即使您不能以这种方式扩展 40 )。因此,gcc可能会使用该算法的某些版本。

If you don't want to record states (although, as I said, there is essentially no cost in doing so) and you don't want to regenerate the state by rescanning the token as you output it (which would also be quite cheap), you could precompute a two-dimensional boolean array keyed by token type and following character. The computation would essentially be the same as the above: for every accepting DFA state which returns a particular token type, enter a true value in the array for that token type and any character with a transition out of the DFA state. Then you can look up the token type of a token and the first character of the following token to see if a space may be necessary. This algorithm does not produce a minimally-spaced output: it would, for example, put a space after the 40 in your example, since 40 is a pp-number and it is possible for some pp-number to be extended with a + (even though you cannot extend 40 in that way). So it's possible that gcc uses some version of this algorithm.

这篇关于C预处理器插入的空间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆