在语句末尾解析可选的分号 [英] Parsing optional semicolon at statement end

查看:128
本文介绍了在语句末尾解析可选的分号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个解析器来解析类似C的语法.

I was writing a parser to parse C-like grammars.

首先,它现在可以解析如下代码:

First, it could now parse code like:

a = 1;
b = 2;

现在我想使行尾的分号为可选.

Now I want to make the semicolon at the end of line optional.

最初的YACC规则是:

The original YACC rule was:

stmt: expr ';' { ... }

新行由我自己编写的词法分析器处理(代码简化):

Where the new line is processed by the lexer that written by myself(the code are simplified):

rule(/\r\n|\r|\n/)          { increase_lineno(); return :PASS }

此处的:PASS指令等效于在LEX中不返回任何内容,该操作将删除当前匹配的文本并跳至下一条规则,就像通常使用空格一样.

the instruction :PASS here is equivalent to return nothing in LEX, which drop current matched text and skip to the next rule, just like what is usually done with whitespaces.

因此,我不能简单地将YACC规则更改为:

Because of this, I can't just simply change my YACC rule into:

stmt: expr end_of_stmt { ... }
;
end_of_stmt: ';'
    | '\n'
;

因此,我选择了由解析器动态地更改词法分析器的状态.

So I chose to change the lexer's state dynamically by the parser correspondingly.

赞:

stmt: expr { state = :STATEMENT_END } ';' { ... }

并添加可以使新行与新状态匹配的词法分析器规则:

And add a lexer rule that can match new line with the new state:

rule(/\r\n|\r|\n/, :STATEMENT_END) { increase_lineno(); state = nil; return ';' }

这表示词法分析器何时处于:STATEMENT_END状态.它将首先像往常一样增加行号,然后将状态设置为初始状态,然后假装自身为分号.

Which means when the lexer is under :STATEMENT_END state. it will first increase the line number as usual, and then set the state into initial one, and then pretend itself is a semicolon.

奇怪的是,它实际上不适用于以下代码:

It's strange that it doesn't actually work with following code:

a = 1
b = 2

我调试了它,但实际上并没有得到';'.如预期的那样,在扫描数字1之后的换行符时,并没有真正执行状态指定的规则.

I debugged it and got it is not actually get a ';' as expect when scanned the newline after the number 1, and the state specified rule is not really executed.

设置新状态的代码在已经扫描新行并且什么也不返回之后执行,这意味着这些工作是按以下顺序完成的:

And the code to set the new state is executed after it already scanned the new line and returned nothing, that means, these works is done as following order:

  1. 扫描a=1
  2. 扫描换行符并跳过,因此获取下一个值b
  3. 执行插入的代码({ state = :STATEMENT_END })
  4. 引发错误-意外的b在这里
  1. scan a, = and 1
  2. scan newline and skip, so get the next value b
  3. the inserted code({ state = :STATEMENT_END }) is executed
  4. raising error -- unexpected b here

这是我的期望:

  1. 扫描a=1
  2. 发现它符合规则expr,因此简化为stmt
  3. 执行插入的代码以设置新的词法分析器状态
  4. 扫描换行符,并根据新的状态匹配规则返回;
  5. 继续扫描&解析以下行
  1. scan a, = and 1
  2. found that it matches the rule expr, so reduce into stmt
  3. execute the inserted code to set the new lexer state
  4. scan the newline and return a ; according the new state matching rule
  5. continue to scan & parse the following line

自省之后,我发现可能是由于YACC使用LALR(1)引起的,该解析器将首先读取一个令牌.扫描到该位置时,状态尚未设置,因此无法获取正确的令牌.

After introspection I found that might caused as YACC uses LALR(1), this parser will read forward for one token first. When it scans to there, the state is not set yet, so it cannot get a correct token.

我的问题是:如何使其按预期工作?我对此一无所知.

My question is: how to make it work as expected? I have no idea on this.

谢谢.

推荐答案

首先要认识到的是,拥有像这样的可选行终止符会在您的语言中引入歧义,因此您首先需要确定要使用哪种方式来解决模棱两可.在这种情况下,主要的歧义来自可能是中缀或前缀的运算符.例如:

The first thing to recognize is that having optional line terminators like this introduces ambiguity into your language, and so you first need to decide which way you want to resolve the ambiguity. In this case, the main ambiguity comes from operators that may be either infix or prefix. For example:

a = b
-c;

您想将以上内容视为一个单独的expr语句,还是将两个第一个分号删除的独立语句?类似的潜在歧义会在类似C语言的函数调用语法中发生:

Do you want to treat the above as a single expr-statement, or as two separate statements with the first semicolon elided? A similar potential ambiguity occurs with function call syntax in a C-like language:

a = b
(c);

如果要将它们分解为两个语句,则可以使用您尝试过的方法.您只需要更早地将状态设置为一个令牌即可.这变得棘手,因为如果您未设置括号,则不想设置状态,因此最终需要一个额外的状态var来记录括号嵌套深度,并且仅在需要时设置insert-semi-before-newline状态. 0.

If you want these to resolve as two statements, you can use the approach you've tried; you just need to set the state one token earlier. This gets tricky as you DON'T want to set the state if you have unclosed parenthesis, so you end up needing an additional state var to record the paren nesting depth, and only set the insert-semi-before-newline state when that is 0.

如果您想将上述情况作为一个语句解决,那么事情就会变得棘手,因为实际上您需要更多前瞻性来决定换行符应何时结束一条语句-至少您需要在换行符之后查看令牌(以及任何评论或其他被忽略的内容).在这种情况下,您可以让词法分析器进行额外的前瞻.如果您使用的是flex(显然不是吗?),我建议您使用/运算符(直接进行超前查找),或推迟返回分号,直到与下一个标记匹配的词法分析器规则为止.

If you want to resolve the above cases as one statement, things get tricky, as you actually need more lookahead to decide when a newline should end a statement -- at the very least you need to look at the token AFTER the newline (and any comments or other ignored stuff). In this case you can have the lexer do the extra lookahead. If you were using flex (which you're apparently not?), I would suggest either using the / operator (which does lookahead directly), or defer returning the semicolon until the lexer rule that matches the next token.

通常,在进行这种令牌状态记录时,我发现在可能的情况下完全在词法分析器中最简单地完成此操作,因此您不必担心有时(但并非总是)完成额外的超前标记由解析器.在这种特定情况下,一种简单的方法是让词法分析器记录所看到的括号((为+1,)为-1),并返回最后一个标记.然后,在换行规则中,如果paren级别为0并且最后一个标记是可以结束表达式的值(ID或常数或)或仅用于后缀运算符),则返回多余的;

In general, when doing this kind of token state recording, I find it easiest to do it entirely within the lexer where possible, so you don't need to worry about the extra token of lookahead sometimes (but not always) done by the parser. In this specific case, an easy approach would be to have the lexer record the parenthesis seen (+1 for (, -1 for )), and the last token returned. Then, in the newline rule, if the paren level is 0 and the last token was something that could end an expression (ID or constant or ) or postfix-only operator), return the extra ;

另一种方法是让词法分析器返回NEWLINE作为其自己的令牌.然后,您将更改解析器以接受stmt: expr NEWLINE以及语法中大多数其他标记之间的可选换行符.这将歧义直接暴露给解析器(现在不是LALR(1)),因此您需要通过使用yacc的运算符优先级规则(容易出错且容易出错)或使用诸如bison的%glr-parser选项或btyacc的回溯之类的方法来解决它直接处理歧义的能力.

An alternate approach is to have the lexer return NEWLINE as its own token. You would then change the parser to accept stmt: expr NEWLINE as well as optional newlines between most other tokens in the grammar. This exposes the ambiguity directly to the parser (its now not LALR(1)), so you need to resolve it either by using yacc's operator precedence rules (tricky and error prone), or using something like bison's %glr-parser option or btyacc's backtracking ability to deal with the ambiguity directly.

这篇关于在语句末尾解析可选的分号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆