Lexer和Parser的职责 [英] Responsibilities of the Lexer and the Parser

查看:220
本文介绍了Lexer和Parser的职责的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在为一种简单的编程语言实现词法分析器。到目前为止,我可以正确地标记标识符,赋值符号和整数文字。通常,空格是无关紧要的。

I'm currently implementing a lexer for a simple programming language. So far, I can tokenize identifiers, assignment symbols, and integer literals correctly; in general, whitespace is insignificant.

对于输入 foo = 42 ,可以识别三个标记:

For the input foo = 42, three tokens are recognized:


  1. foo (标识符)

  2. = (符号)

  3. 42 (整数文字)

  1. foo (identifier)
  2. = (symbol)
  3. 42 (integer literal)

到目前为止,很好。但是,请考虑输入 foo = 42bar ,由于 42 无效 code>和 bar 。我的词法分析器错误地识别了以下标记:

So far, so good. However, consider the input foo = 42bar, which is invalid due to the (significant) missing space between 42 and bar. My lexer incorrectly recognizes the following tokens:


  1. foo (标识符)

  2. = (符号)

  3. 42 (整数文字)

  4. bar (标识符)

  1. foo (identifier)
  2. = (symbol)
  3. 42 (integer literal)
  4. bar (identifier)

一旦词法分析器看到数字 4 ,它就会继续读取直到遇到一个非数字。因此,它使用 2 并将 42 存储为整数文字标记。由于空格无关紧要,因此词法分析器会丢弃所有空格(如果有的话),并开始读取下一个标记:查找标识符 bar

Once the lexer sees the digit 4, it keeps reading until it encounters a non-digit. It therefore consumes the 2 and stores 42 as an integer literal token. Because whitespace is insignificant, the lexer discards any whitespace (if there is any) and starts reading the next token: It finds the identifier bar.

现在,这是我的问题:lexer仍然有责任认识到该位置不允许使用标识符吗?还是该检查属于解析器的职责?

Now, here's my question: Is it still the lexer's responsibility to recognize that an identifier is not allowed at that position? Or does that check belong to the responsibilities of the parser?

推荐答案

我认为关于以下问题没有达成共识 42foo 应该被识别为无效数字还是两个标记。这是一个样式问题,两种用法在知名语言中都是常见的。

I don't think there is any consensus to the question of whether 42foo should be recognised as an invalid number or as two tokens. It's a question of style and both usages are common in well-known languages.

例如:

$ python -c 'print 42and False'
False

$ lua -e 'print(42and false)'
lua: (command line):1: malformed number near '42a'

$ perl -le 'print 42and 0'
42

# Not an idiosyncracy of tcc; it's defined by the standard
$ tcc -D"and=&&" -run - <<<"main(){return 42and 0;}"
stdin:1: error: invalid number

# gcc has better error messages
$ gcc -D"and=&&" -x c - <<<"main(){return 42and 0;}" && ./a.out
<stdin>: In function ‘main’:
<stdin>:1:15: error: invalid suffix "and" on integer constant
<stdin>:1:21: error: expected ‘;’ before numeric constant

$ ruby -le 'print 42and 1'
42

# And now for something completely different (explained below)
$ awk 'BEGIN{print 42foo + 3}'
423

因此,两种可能性都被普遍使用。

So, both possibilities are in common use.

如果您要拒绝它是因为您认为数字和单词应该用空格隔开, ,您应该在词法分析器中拒绝它。解析器无法(或不应该)知道空格是否分隔两个标记。与 42和的有效性无关,片段 42 + 1 42 + 1 42+ 1 )应全部解析。 (也许在Fortress中除外,但这是一个反常现象。)如果您不介意将数字和单词混在一起,则在(且仅当)语法错误时,让解析器拒绝它。

If you're going to reject it because you think a number and a word should be separated by whitespace, you should reject it in the lexer. The parser cannot (or should not) know whether whitespace separates two tokens. Independent of the validity of 42and, the fragments 42 + 1, 42+1, and 42+ 1) should all be parsed identically. (Except, perhaps, in Fortress. But that was an anomaly.) If you don't mind shoving numbers and words together, then let the parser reject it if (and only if) it is a syntax error.

作为补充,在C和C ++中, 42和最初被词汇化为预处理程序编号。预处理之后,需要重新处理它,并在此时生成错误消息。出现这种奇怪行为的原因是,将两个片段粘贴在一起以产生有效数字是完全合法的:

As a side note, in C and C++, 42and is initially lexed as a "preprocessor number". After preprocessing, it needs to be relexed and it is at that point that the error message is produced. The reason for this odd behaviour is that it is completely legitimate to paste together two fragments to produce a valid number:

$ gcc -D"c_(x,y)=x##y" -D"c(x,y)=c_(x,y)"  -x c - <<<"int main(){return c(12E,1F);}"
$ ./a.out; echo $?
120

两个 12E 1F 是无效的整数,但是与 ## 运算符粘贴在一起,它们形成了一个完全合法的浮点数。 ## 运算符仅适用于单个令牌,因此 12E 1F 都需要词法化为单个标记。 c(12E +,1F)不起作用,但 c(12E0,1F)也可以。

Both 12E and 1F would be invalid integers, but pasted together with the ## operator, they form a perfectly legitimate float. The ## operator only works on single tokens, so 12E and 1F both need to lexed as single tokens. c(12E+,1F) wouldn't work, but c(12E0,1F) is also fine.

这也是为什么您应该始终在C中的 + 运算符周围放置空格的原因:经典技巧C问题:什么是 0x1E + 2 的值?

This is also why you should always put spaces around the + operator in C: classic trick C question: "What is the value of 0x1E+2?"

最后,awk行的解释:

Finally, the explanation for the awk line:

$ awk 'BEGIN{print 42foo + 3}'
423

awk将此词形容为 BEGIN {print 42 foo + 3} ,然后将其解析为好像已编写 BEGIN {print(42)(foo + 3);} 。在awk中,字符串连接是在没有运算符的情况下编写的,但是绑定的绑定性不如任何算术运算符。因此,通常的建议是在涉及串联的表达式中使用显式括号,除非它们确实很简单。 (此外,如果算术使用,则假定未定义的变量值为 0 ,如果用作字符串,则其值为 。)

That's lexed by awk as BEGIN{print 42 foo + 3} which is then parsed as though it had been written BEGIN{print (42)(foo + 3);}. In awk, string concatenation is written without an operator, but it binds less tightly than any arithmetic operator. Consequently, the usual advice is to use explicit parentheses in expressions which involve concatenation, unless they are really simple. (Also, undefined variables are assumed to have the value 0 if used arithmetically and "" if used as strings.)

这篇关于Lexer和Parser的职责的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆