当一个规则覆盖另一个规则的子集时,消除语法歧义 [英] Eliminating grammar ambiguity when a rule covers a subset of another

查看:83
本文介绍了当一个规则覆盖另一个规则的子集时,消除语法歧义的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试建立一个小的野牛语法,但是部分定义存在问题.可以使用右侧合法的任何表达式(语法中的expression_list)作为参数来调用函数.

I am trying to build a small bison grammar, but am having an issue with part of the definition. Functions can be called with any expression legal on the right side (expression_list in the grammar) as arguments.

之所以出现此问题,是因为可以在左侧通过分配功能来定义功能(一个标识符,后跟一个标识符列表-语法中的assignment_expression和identifier_list)

The issue arises because on the left side, functions can be defined by assigning to them (an identifier followed by a list of identifiers - assignment_expression and identifier_list in the grammar)

我的问题是,如何消除语法中的歧义,因为左侧的合法语句是右侧合法语句的子集.

My question is how I can eliminate the ambiguity in my grammar, since the statements legal on the left side are a subset of those legal on the right.

语法是用野牛写成的(v.2.4.1)

The grammar is written in bison (v. 2.4.1)

命令的输出为:

2 shift/reduce, 2 reduce/reduce
warning: rule useless in parser due to conflicts: assignment_expression: IDENTIFIER LPAREN RPAREN

这是完整的语法:

expression:
    assignment_expression
    | expression DECORATOR IDENTIFIER

value:
    IDENTIFIER
    | HEX 
    | BIN 
    | OCT
    | SCI 
    | FLOAT 
    | INT
    ;

constant_expression:
    value
    | LPAREN constant_expression RPAREN
    | constant_expression OR constant_expression
    | constant_expression XOR constant_expression
    | constant_expression AND constant_expression
    | constant_expression LSHIFT constant_expression
    | constant_expression RSHIFT constant_expression
    | constant_expression PLUS constant_expression
    | constant_expression MINUS constant_expression
    | constant_expression MUL constant_expression
    | constant_expression DIV constant_expression
    | constant_expression MOD constant_expression
    | constant_expression POW constant_expression
    | constant_expression FACTORIAL
    | NOT constant_expression
    | IDENTIFIER LPAREN RPAREN
    | IDENTIFIER LPAREN constant_expression RPAREN
    | IDENTIFIER LPAREN expression_list RPAREN
    ;

expression_list:
    constant_expression COMMA constant_expression
    | expression_list COMMA constant_expression
    ;

assignment_expression:
    constant_expression
    | IDENTIFIER EQUAL assignment_expression
    | IDENTIFIER LPAREN RPAREN
    | IDENTIFIER LPAREN IDENTIFIER RPAREN
    | IDENTIFIER LPAREN identifier_list RPAREN
    ;

identifier_list:
    IDENTIFIER COMMA IDENTIFIER
    | identifier_list COMMA IDENTIFIER
    ;

以下是详细模式(-v)中的野牛输出的相关部分:

Here are the relevant sections of bison's output from verbose mode (-v):

State 34 conflicts: 2 shift/reduce
State 35 conflicts: 2 reduce/reduce

state 34

3 value: IDENTIFIER .
25 constant_expression: IDENTIFIER . LPAREN RPAREN
26                    | IDENTIFIER . LPAREN constant_expression RPAREN
27                    | IDENTIFIER . LPAREN expression_list RPAREN
33 assignment_expression: IDENTIFIER LPAREN IDENTIFIER . RPAREN
35 identifier_list: IDENTIFIER . COMMA IDENTIFIER

COMMA   shift, and go to state 53
LPAREN  shift, and go to state 39
RPAREN  shift, and go to state 54

COMMA     [reduce using rule 3 (value)]
RPAREN    [reduce using rule 3 (value)]
$default  reduce using rule 3 (value)


state 35

25 constant_expression: IDENTIFIER LPAREN RPAREN .
32 assignment_expression: IDENTIFIER LPAREN RPAREN .

$end       reduce using rule 25 (constant_expression)
$end       [reduce using rule 32 (assignment_expression)]
DECORATOR  reduce using rule 25 (constant_expression)
DECORATOR  [reduce using rule 32 (assignment_expression)]
$default   reduce using rule 25 (constant_expression)

根据请求,这里是一个最小的语法问题:

As per request here is a minimal grammar with the issue:

assignment_expression:
    constant_expression
    | IDENTIFIER LPAREN identifier_list RPAREN
    ;

value:
    IDENTIFIER
    | INT
    ;

constant_expression:
    value
    | IDENTIFIER LPAREN expression_list RPAREN
    ;

expression_list:
    constant_expression COMMA constant_expression
    | expression_list COMMA constant_expression
    ;

identifier_list:
    IDENTIFIER COMMA IDENTIFIER
    | identifier_list COMMA IDENTIFIER
    ;

推荐答案

您的文字和语法不太一致.也许我无法正确理解您的文字.你说:

Your text and your grammar don't quite line up. Or maybe I'm not understanding your text correctly. You say:

在左侧,可以通过分配功能来定义功能(一个标识符,后跟一个标识符列表-语法中的assignment_expression和identifier_list)

on the left side, functions can be defined by assigning to them (an identifier followed by a list of identifiers - assignment_expression and identifier_list in the grammar)

在我的脑海中,我想像这样的例子:

In my head, I imagine an example of that to be something like:

comb(n, r) = n! / (r! * (n-r)!)

但是您的语法为:

assignment_expression:
    constant_expression
    | IDENTIFIER EQUAL assignment_expression
    | IDENTIFIER LPAREN RPAREN
    | IDENTIFIER LPAREN IDENTIFIER RPAREN
    | IDENTIFIER LPAREN identifier_list RPAREN

不会解析上面的定义,因为唯一出现在 EQUAL 左侧的是 IDENTIFIER .右递归允许在Assignment_expression之前重复 IDENTIFIER = 的任意数量,但是最后一件事必须是 constant_expression 或三个原型产品之一.因此,这将是匹配的:

Which will not parse the above definition, because the only thing which can appear to the left hand side of EQUAL is IDENTIFIER. The right-recursion allows any number of repetitions of IDENTIFIER = before an assignment_expression, but the last thing must either be either a constant_expression or one of the three prototype productions. So this would be matched:

c = r = f(a,b)

但这也是:

c = r = f(2, 7)

我要说的是,这使您的语法本来就模棱两可,但这可能是一个错误.您可能的意思是:

I'd say that makes your grammar inherently ambiguous, but it is probably an error. What you probably meant was:

assignment_expression: rvalue
                     | lvalue '=' assignment_expression

rvalue: constant_expression

lvalue: IDENTIFIER
      | IDENTIFIER '(' ')'
      | IDENTIFIER '(' identifier_list ')'

我顺便指出,您对 identifier_list 的定义至少需要两个标识符是不必要的复杂,因此我在上面假设 identifier_list 的实际定义是:

I note in passing that your definition of identifier_list as requiring at least two identifiers is unnecessarily complicated, so I've assumed above that the actual definition of identifier_list is:

identifier_list: IDENTIFIER | identifier_list ',' IDENTIFIER

但是,这不足以解决问题.它仍然使解析器不知道是否:

That's not sufficient to solve the problem, though. It still leaves the parser not knowing whether:

comb(n      | lookahead ',' 

comb(n, r) = ...

或只是一个函数调用

comb(n, 4)

所以要解决这个问题,我们需要抽出一些重型火炮.

So to fix that, we need to pull out some heavy artillery.

我们可以从简单的解决方案开始.该语法不是模棱两可的,因为必须在 = 之后跟随 lvalue .当我们最终到达 = 时,即使它们看起来恰好相距很远,也可以告诉我们到目前为止是 rvalue 还是 lvalue 完全相同的.(例如, comb(n,r).)唯一的问题是, = 可能与我们碰巧的位置无限距离.

We can start with the simple solution. This grammar is not ambiguous, since an lvalue must be followed by =. When we finally reach the =, we can tell whether what we have so far is an rvalue or an lvalue, even if they happen to look identical. (comb(n, r), for example.) The only issue is that the = may be an unlimited distance from where we happen to be.

对于野牛,如果我们的语法很明确,并且不麻烦解决前瞻性问题,则可以要求使用GLR解析器.GLR解析器的效率稍低,因为它需要并行维护所有可能的解析,但是对于大多数明确的语法来说,它仍然是线性复杂性.(GLR解析器甚至可以解析O(N 3 )中的模棱两可的语法,但是bison实现不能容忍歧义.毕竟,它旨在解析编程语言.)

With bison, if we have an unambiguous grammar and we cannot be bothered to fix the lookahead problem, we can ask for a GLR parser. The GLR parser is slightly less efficient because it needs to maintain all possible parses in parallel, but it is still linear complexity for most unambiguous grammars. (GLR parsers can parse even ambiguous grammars in O(N3) but the bison implementation doesn't tolerate ambiguity. It's designed to parse programming languages, after all.)

因此,您只需添加

%glr-parser

并阅读《野牛手册》中有关如何影响语义动作的部分.(摘要:它们一直存储到解析被消除为止,因此它们可能不会像在LALR(1)解析器中那样早在解析期间发生.)

and read the section of the bison manual about how semantic actions are affected. (Summary: they are stored up until the parse is disambiguated, so they may not happen as early during the parse as they would in an LALR(1) parser.)

在实践中相当普遍的第二种简单解决方案是让解析器接受所需语言的超集,然后在语义动作中添加可以说是句法检查的内容.因此,您可以编写语法以允许任何看起来像 call_expression 的内容都位于分配的左侧,但是当您实际为分配/定义构建AST节点时,请验证调用的参数列表实际上是一个简单的标识符列表.

A second simple solution, which is fairly common in practice, is to have the parser accept a superset of the desired language, and then add what is arguably a syntactic check in the semantic action. So you could just write the grammar to allow anything which looks like a call_expression to be on the left-hand-side of an assignment, but when you actually build the AST node for the assignment/definition, verify that the argument list for the call is actually a simple list of identifiers.

这不仅简化了语法,而且没有太多实现成本,而且还可以生成准确的错误消息来描述语法错误,这对于标准的LALR(1)解析器而言并不容易.

Not only does that simplify your grammar without much implementation cost, it makes it possible to generate accurate error messages to describe the syntax error, something which is not easy with a standard LALR(1) parser.

不过,对于您的语言(或更确切地说,对于您所想的语言而言),有一个 LALR(1)语法.为了产生它,我们需要避免强制减少,直到我们知道它是哪一个为止,该减少将区分 lvalue rvalue .

Still, there is an LALR(1) grammar for your language (or, rather, for what I imagine your language to be). In order to produce it, we need to avoid forcing a reduction which would distinguish between an lvalue and an rvalue until we know which one it is.

因此问题将是 IDENTIFIER 可能是expression_list的一部分,也可能是identifier_list的一部分.而且,即使看到),我们也不知道是哪一个.因此,我们需要特殊地使用 IDENTIFIER'('identifier_list')',以使其成为 lvalue rvalue 的一部分.换句话说,我们需要类似的东西:

So the issue will be that an IDENTIFIER could either be part of an expression_list or part of an identifier_list. And we don't know which one, even when we see the ). Consequently, we need to special case IDENTIFIER '(' identifier_list ')', to allow it to be part of both lvalue and rvalue. In other words, we need something like:

lvalue: IDENTIFIER | prototype
rvalue: expression_other_than_lvalue | lvalue

剩下的问题是我们如何定义 expression_other_than_lvalue .

Which leaves the question of how we define expression_other_than_lvalue.

在很多时候,解决方案很简单:常量,运算符表达式,带括号的表达式;这些都不是左值.带有括号列表且包含 expression_other_than_identifier 的呼叫也是 expression_other_than_identifier .唯一不算的是 IDENTIFIER(IDENTIFIER,IDENTIFIER,...)

Much of the time, the solution is simple: constants, operator expressions, parenthesized expressions; none of these can be lvalues. A call with a parenthesized list which includes an expression_other_than_identifier is also an expression_other_than_identifier. The only thing which won't count is precisely IDENTIFIER(IDENTIFIER,IDENTIFIER,...)

因此,让我们尽可能地重写语法.(我已将 constant_expression 更改为 lvalue ,因为它的键入时间较短.并用许多标记名称代替了实际的符号,我觉得这样更具可读性.但是下面的大部分内容与您的原件相同.)

So let's rewrite the grammar as far as we can. (I've changed constant_expression to lvalue because it was shorter to type. And substituted many token names for the actual symbol, which I find more readable. But much of the following is the same as your original.)

value_not_identifier: HEX | BIN | OCT | SCI | FLOAT | INT

expr_not_lvalue:
    value_not_identifier
    | '(' rvalue ')'
    | rvalue OR rvalue
    | ...
    | IDENTIFIER '(' list_not_id_list ')'

lvalue:
    IDENTIFIER
    | IDENTIFIER '(' ')'
    | IDENTIFIER '(' identifier_list ')'

identifier_list:
    IDENTIFIER | identifier_list ',' IDENTIFIER

现在,除了我们尚未定义 list_not_id_list 的细节之外,所有内容都将落实到位. lvalue expr_not_lvalue 是不相交的,因此我们可以得出以下结论:

Now, aside from the detail that we haven't defined list_not_id_list, everything will fall into place. lvalue and expr_not_lvalue are disjoint, so we can finish up with:

rvalue:
    lvalue
    | expr_not_lvalue

assignment_expression:
    rvalue
    | lvalue '=' assignment_expression

我们只需要处理不是标识符列表的表达式列表.如上所述,这类似于:

And we only need to deal with expression lists which are not identifier lists. As noted above, that is something like:

expr_not_identifier:
    expr_not_lvalue
    lvalue_not_identifier

list_not_id_list:
    expr_not_identifier
    | list_not_id_list ',' rvalue
    | identifier_list ',' expr_not_identifier

因此,在解析列表时,第一次发现不是标识符的内容时,我们从 identifier_list 生产中删除了该列表.如果我们遍历整个列表,则当需要 rvalue 时,我们可能仍会使用 lvalue 找到自己,但是当我们看到 = 或语句终止符.

So while parsing a list, the first time we find something which is not an identifier, we remove the list from the identifier_list production. If we get through the whole list, then we might still find ourselves with an lvalue when an rvalue is desired, but that decision (finally) can be made when we see the = or statement terminator.

所以正确的(我希望)完整的语法是:

So the correct (I hope) complete grammar is:

expression:
    assignment_expression
    | expression DECORATOR IDENTIFIER

assignment_expression:
    rvalue
    | lvalue '=' assignment_expression

value_not_identifier: HEX | BIN | OCT | SCI | FLOAT | INT

expr_not_lvalue:
    value_not_identifier
    | '(' rvalue ')'
    | rvalue OR rvalue
    | rvalue XOR rvalue
    | rvalue AND rvalue
    | rvalue LSHIFT rvalue
    | rvalue RSHIFT rvalue
    | rvalue '+' rvalue
    | rvalue '-' rvalue
    | rvalue '*' rvalue
    | rvalue '/' rvalue
    | rvalue '%' rvalue
    | rvalue POW rvalue
    | rvalue '!'
    | NOT rvalue
    | IDENTIFIER '(' list_not_id_list')'

lvalue_not_identifier:
    IDENTIFIER '(' ')'
    | IDENTIFIER '(' identifier_list ')'

lvalue:
    lvalue_not_identifier
    | IDENTIFIER

rvalue:
    lvalue
    | expr_not_lvalue

identifier_list:
    IDENTIFIER | identifier_list ',' IDENTIFIER

list_not_id_list:
    expr_not_identifier
    | list_not_id_list ',' rvalue
    | identifier_list ',' expr_not_identifier

expr_not_identifier:
    expr_not_lvalue
    lvalue_not_identifier

鉴于简单解决方案的可用性以及实现精确语法所需的转换的简洁性,难怪您很少会看到这种构造.但是,您会发现它在ECMA-262标准(定义ECMAScript aka Javascript)中得到了广泛使用.该报告中使用的语法形式主义包括一种宏功能,可以简化上述转换,但是并不能使语法更易于阅读(恕我直言),而且我不知道实现该功能的解析器生成器.

Given the availability of simple solutions, and the inelegance of the transformations required to implement the precise grammar, it is little wonder that you rarely see this sort of construction. However, you will find it used extensively in the ECMA-262 standard (which defines ECMAScript aka Javascript). The grammar formalism used in that report includes a kind of macro feature which simplifies the above transformation, but it doesn't make the grammar any easier to read (imho), and I don't know of a parser generator which implements that feature.

这篇关于当一个规则覆盖另一个规则的子集时,消除语法歧义的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆