野牛减少/减少情况 [英] Bison reduce/reduce situation

查看:79
本文介绍了野牛减少/减少情况的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个特殊的构造,调试时遇到麻烦.我希望这种积极的简化能够正确地说明问题:

 %token VAR IDENTIFIER NUMBER INT%%根:code_statements;code_statements:code_statement|code_statements code_statement;code_statement:var_statement|expression_statement;var_decl:IDENTIFIER':'type_expression;var_statement:VAR var_decl';';expression_statement:value_expression';';function_call:'('')'|'('value_expression_list')';value_expression_list:value_expression|value_expression_list','value_expression;/****未知的表达(值/类型?)****/unknown_expression:识别码|unknown_expression'.识别码|postfix_known_value_expression'.识别码|postfix_known_type_expression'.识别码;/****值表达****/primary_value_expression:数字|'('value_expression')';postfix_known_value_expression:primary_value_expression|postfix_value_expression函数调用;postfix_value_expression:postfix_known_value_expression|unknown_expression;value_expression:postfix_value_expression;/****类型表达****/primary_type_expression:INT|'('type_expression')';postfix_known_type_expression:primary_type_expression|postfix_type_expression'*';postfix_type_expression:postfix_known_type_expression|unknown_expression;type_expression:postfix_type_expression;%% 

错误输出:

 状态1114 unknown_expression:未知表达式.'.'识别码22 postfix_value_expression:unknown_expression.29个postfix_type_expression:unknown_expression.'.'转移到状态26';'减少使用规则22(postfix_value_expression)';'[减少使用规则29(postfix_type_expression)]')'使用规则22减少(postfix_value_expression)')'[使用规则29减少(postfix_type_expression)]'*'使用规则29(postfix_type_expression)进行减少$ default使用规则22减少(postfix_value_expression) 

这里有2个问题,一个与)相关的问题很容易理解,但是我一直在努力寻找一种解决问题的优雅方法.关于; 的问题,我不太清楚它是如何发生的(尽管我确实知道如何影响它),而且我不确定如何从错误中获取相关信息以进行调试实际的问题.

在我实际的语法中,还有更多,但我认为模式基本上与这两种模式相同.

解决方案

这里的基本问题是状态11中所示的一对生产:(注意:我使用了命令行选项-report= all 而不是 -v 以便查看超前计算.)

  22 postfix_value_expression:unknown_expression.[';','((',')',',']29个postfix_type_expression:unknown_expression.[';',')','*'] 

在这里您可以看到, postfix_value_expression postfix_type_expression 的计算后瞻都包括; )./p>

第一个是(有些)虚假的;这是LALR算法合并实际上具有不同先行状态的两个状态的极少数情况之一.(在 bison手册中对此问题进行了简要讨论.如果我们要求野牛使用Canonical LR算法(%define lr.type canonical-lr ),则此冲突消失,但第二个冲突仍然存在(在状态52):

  22 postfix_value_expression:unknown_expression.['(',')']29 postfix_type_expression:未知表达式.[')','*'] 

我们在这里的语法确实有歧义.(正如语法似乎表明的那样,我假设类型"可以具有值"成员(反之亦然).)

假设输入包含

  x.member 

解析器无需知道 x 是类型还是值. x 将简单地解析为 unknown_expression (生产13),索引选择也将是 unknown_expression (生产14).[注意:我在此答案的末尾从输出文件中复制了数字产生规则].

但是由于类型可以加括号(产生25),因此可以合理地编写以上代码:

 (x).member 

现在解析器需要应用生产15或16才能得出 unknown_expression ,这意味着它需要使用生产22来减少 x (继续生产15)或产品29(用于产品16).但是,它不知道 x 是类型还是值,并且它不知道(至少在不查阅符号表的情况下).尽管这两个可能的约简是(大概)相等的,但是它们是不同的产生式,所以语法是模棱两可的.因此,减少/减少冲突.

以这种方式看,冲突是我们可能称为过早归类的结果.由于语法实际上并不关心索引运算符.的左操作数是类型还是值,因此此时不需要解析左操作数.在后续步骤中,将通过语义分析来解决这一问题.在这种情况下,最简单的解决方案是不用费心在语法中区分类型和值,这对于这种简单的语言子集来说就很好了.一个非常简单的语法就足够了:

  expression:postfix_expressionpostfix_expression:期限|postfix_expression'.识别码|postfix_expression'('expression_list')'|postfix_expression'*'期限:IDENTIFIER|'(' 表达 ')' 

一旦对所有标识符进行了分类,便可以在后树遍历中进行任何必要的语义检查(以确保表达式是正确的种类).

询问是否有必要在语法上解析表达式的类别很有用.确实需要消除语法歧义吗?

在C语言(和朋友)中,除非对标识符进行分类,否则该语言实际上是模棱两可的.例如,(x)* y 可以是:

  • 已取消引用的指针 y 的强制转换,或
  • 标量 x y
  • 的乘积

在不知道 x 的类别的情况下,无法生成正确的解析树.(为使这一点更加清楚,请考虑 z/(x)* y ,其中两种选择之间的解析树形状完全不同.还有许多其他示例.)

为C生成准确的分析树的常用解决方案是诉诸词法反馈,这反过来要求在使用前至少在分类的程度上进行声明.(即使C ++坚持要在使用前声明类型别名,尽管它允许在声明之前使用类成员.)

一个合理的选择是使用类似GLR解析器的东西,它会生成两个(或全部)解析树.(这通常称为 parse forest .)一旦对标识符进行了分类,随后的语义分析遍历就可以对解析森林进行修剪,尽管对于森林中的所有解析树而言,分类仍然必须是明确的.[注1]

避免上述歧义主要是语言设计的问题;例如,如果C强制转换的语法使用方括号而不是括号,则可以消除此问题.(当然,这会耗尽C ++用于lambda的语法空间.)尽管如此,即使不知道每个标识符的类别,也很容易想象一种语言的语法是明确.但是,由于语法差异,需要考虑表达域.您的语言似乎将采用这种形式.(从某种意义上讲,这是自上而下的分类,而不是自下而上的分类:声明和表达式语句具有不同的解析上下文.)

例如,很容易想象您期望以与C系列类似的方式使用 * 作为前缀解引用运算符和中缀乘法运算符.与上面显示的postfix类型构造运算符结合使用,最后得到一个用于前缀,postfix和infix的运算符.如果所有用途都被乱扔到同一个包中,那肯定是模棱两可的:单个令牌最多可以表示两个操作数,即infix运算符,prefix运算符和postfix运算符.如果 * 可以是三种运算符类型中的任何一种,则 a * * b 可能表示(a *)* b 一个*(* b).

但是即使不知道标识符的类别,该表达式仍然可以明确地解析,因为我们知道前缀和后缀 * 运算符都不能应用于后缀 * 类型构造函数.因此,(a *)* b 是不可能的,唯一有效的解析是 a *(* b).

不幸的是,由于LR(1)解析器需要基于单个超前标记来决定是否减少,因此关于是否应将 a * 减少为 * 时是否需要输入> type ,这是不可能的,因为(a * *)是有效的类型表达式.更糟糕的是,表达式中的星号数量没有限制,实际上可能是 a *…* [b] .我们可以根据最后一个星号后的标记来消除歧义,但这需要任意先行.

鉴于上述情况,最好的解决方案可能是GLR解析器,它没有先行限制,并结合了足够的语法以至少划分可以使用诸如 * 之类的运算符的上下文.

我们不能仅仅像OP中所介绍的那样在语法上抛出GLR解析器,因为-如上所述-尽管歧义并不重要,但它绝对是模棱两可的.幸运的是,通过避免过早的分类,我们可以创建明确的语法.

如上所示,带括号的表达式的问题在于语法要求它们要么是类型表达式,要么是值表达式,即使在无关紧要的情况下也是如此.为避免这种情况,我们需要通过创建第三种带括号的表达式来允许将决策延迟到必要时才进行.结果就是下面的语法,实际上是LALR(1).

我分开了表达产生式,因为否则样板不胜枚举.每个产生式有效地指示参数的预期类别,运算符的语法优先级以及结果的类别(或未知的类别).每个产品的左侧是 known_ 类别 _ 优先级 ,声称语法需要该类别,或者 unknown_ 优先级 ,表示语法本身无法预测类别.(例如,(expr).b"可能是类型或值,因此选择运算符的结果是 unknown_postfix .)

在左侧,您将看到形式为 category _ precedence 的非终结符,它表示语义分析必须产生具有该类别的对象.这些非终端实际上​​只是为了方便.每个人都简单地定义为

  category_precedence:known_category_precedence |unknown_precedence 

,其中第二种选择表示​​需要进行语义检查.如果在解析过程中有可能,则将其插入第二种替代方案的操作中.或者,可以将语义检查节点插入AST.

我将这些便利产品包括在样板中,但是其中大部分被注释掉了,因为当语法中的任何地方都没有使用非终端时,野牛会抱怨.

 %token VAR"var"%token标识符类型编号%%程序:%empty|程序说明声明:var_declaration|expr_statementvar_declaration:"var" IDENTIFIER':'类型';'expr_statement:值';'/*表达表达式语法的产品*/参数         : '(' ')'|'('value_list')'value_list:值|value_list','值known_value_postfix:value_postfix参数{/*函数调用*/}unknown_postfix:any_postfix'.标识符{/*选择*/}known_type_postfix:type_postfix'*'{/*指针类型构造函数*/}/*主要术语*/unknown_primary:IDENTIFIER|'('不明')'known_value_primary:NUMBER|'('已知值')'known_type_primary:TYPE|'('known_type')'/*具有两个中缀优先级的样板优先语法*/unknown_postfix:unknown_primaryknown_value_postfix:known_value_primaryknown_type_postfix:known_type_primaryvalue_postfix:已知值_postfix |unknown_postfixtype_postfix:已知类型_后缀|unknown_postfixany_postfix:known_value_postfix |known_type_postfix |unknown_postfixunknown_prefix:unknown_postfixknown_value_prefix:known_value_postfixknown_type_prefix:known_type_postfix/* value_prefix:已知值_前缀|unknown_prefix *//* type_postfix:已知类型前缀|unknown_prefix */unknown_infix9:unknown_prefixknown_value_infix9:known_value_prefixknown_type_infix9:known_type_prefix/* value_infix9:known_value_infix9 |unknown_infix9 *//* type_infix9:已知类型_infix9 |unknown_infix9 */unknown_infix8:unknown_infix9known_value_infix8:known_value_infix9known_type_infix8:known_type_infix9/* value_infix9:known_value_infix8 |unknown_infix8 *//* type_infix9:已知类型_infix8 |unknown_infix8 *//*最后一个节主要是为了方便起见,但它也可以*避免对';'进行虚假的减少/减少冲突.*/未知:unknown_infix8known_value:known_value_infix8known_type:known_type_infix8值:known_value |未知类型:known_type |未知 

有了该框架,我们就可以开始添加其他类型的表达式产品.

首先,让我们面对 * 上的三方冲突.基于上述模式,基本表达式的产生非常简单:

  known_value_prefix:'*'value_prefix {/*取消引用*/}known_value_infix9:value_infix9'*'value_prefix {/*产生*/} 

(我们还需要取消注释 value_prefix value_infix9 生产.)

尽管该语言仍然是明确的,但是它不再是LALR(1)(甚至对于任何k都是LR(k)),如上面在此语法的讨论中所指出的那样.因此,野牛会抱怨说存在减少-减少冲突.我们无法解决该投诉,但是我们可以轻松生成一个有效的解析器;我们需要做的就是插入一个要求野牛生成GLR解析器的请求:

 %glr-parser 

这不能消除冲突警告[注2],但确实可以产生有效的解析器.(由于没有精确的算法,野牛无法验证语法是明确的.)由于我们无法证明语法是明确的,因此我们需要进行大量测试.如果GLR解析器遇到歧义,则会产生一条错误消息,因此当我们看不到任何错误时,我们可以放心:

  $ ./typevar3(* a).b;[EXP [ASVALUE [SELECT [DEREF [ASVALUE a]] b]]a * b;[EXP [PROD [ASVALUE a] [ASVALUE b]]]a ** b;[EXP [PROD [ASVALUE b] [DEREF [ASVALUE b]]]](a *** b).c;[EXP [ASVALUE [选择[PROD [ASVALUE a]] [DEREF [DEREF [ASVALUE b]]] c]]](a ***).c;[EXP [ASVALUE [SELECT [POINTER [POINTER [POINTER [ASTYPE a]]] c]]]] 

现在,假设我们要添加一个类似于C的数组类型构造函数.我们将允许以下形式:

  type [count]//例如.整数[3] 

请注意,该格式在语法上类似于数组索引操作(a [2]),我们还将其添加到语言中.

这与select运算符略有不同.对于选择运算符,语法允许从中选择值或类型,并且无法预测结果.对于数组构造/索引,结果的类别恰好是第一个参数的类别.

因为我们需要传递"第一个参数的类别,所以我们需要三个产生式,就像括号产生式一样:

  known_value_postfix:known_value_postfix'['value']'known_type_postfix:known_type_postfix'['值']'unknown_postfix:unknown_postfix'['value']' 

在第三种情况下,我们需要在AST中插入一个"construct_or_index"节点.可以通过稍后的单元生产将其解析为构造节点或索引节点,以将未知类别转换为特定类别,或者可以将其留给语义分析阶段.

将这三个作品相加不会产生任何问题.但是,现在我们想添加一种用于构造可变大小数组的语法,我们将选择语法 int [*] ,从而生成 * 的另一种不兼容用法勒克美.该语法具有明确的结果类别,因为它不是有效的索引表达式,因此我们可以编写产生式:

  known_type_postfix:type_postfix'[''*'']' 

这一次,我们选择在右侧使用便捷非终端.当然,这会产生大量新的语法冲突,包括移位-归约和归约-归约,但是语法会继续按需运行:

  var a:(int *)[*];[DECL a [ARRAY [POINTER [TYPE int]]]]var a:(int *)[2 * 2];[DECL a [ARRAY [POINTER [TYPE int]]] [PROD [VALUE 2] [VALUE 2]]]a2];[EXP [ASVALUE [INDEX_OR_ARRAY a [VALUE 2]]]](*a2];[EXP [INDEX [DEREF [ASVALUE a]] [VALUE 2]]] 

左为练习:使用infix + 运算符表示两个值的标量加法以及两种类型的并集.请注意,您将不得不处理两个操作数中只有一个具有已知类别的情况(因此,另一个操作数必须被强制转换为同一类别),以及两个操作数都不具有已知类别的情况./p>


注释

  1. 由bison实现的GLR解析器对于此目的而言并不理想,因为它希望生成单个解析树.可以使用%merge 声明(如手册中所述)使用由野牛生成的解析器手动创建一个解析森林,但这不能实现其他GLR所实现的节省空间的图形表示解析器可以产生.

  2. 野牛手册建议使用%expect-rr 指令执行此操作,但是恕我直言,只有在语法可用于生产时才应执行此操作.不幸的是,您只能抑制已知数量的冲突,而不能抑制特定的预期冲突;我想这是因为很难精确地描述单个预期的冲突,但是最终结果是抑制冲突使得在开发语法时容易遗漏问题.验证冲突是否是预期的冲突是一件令人烦恼的事情,但比缺少语法错误更令人讨厌.

  3. 带有编号产生式的原始语法:

      1个根:code_statements2个code_statement:code_statement3 |code_statements code_statement4个code_statement:var_statement5 |expression_statement6 var_decl:IDENTIFIER':'type_expression7 var_statement:VAR var_decl';'8 expression_statement:value_expression';'9 function_call:'('')'10 |'('value_expression_list')'11 value_expression_list:value_expression12 |value_expression_list','value_expression13 unknown_expression:IDENTIFIER14 |unknown_expression'.识别码15 |postfix_known_value_expression'.识别码16 |postfix_known_type_expression'.识别码17 primary_value_expression:NUMBER18 |'('value_expression')'19 postfix_known_value_expression:主值表达式20 |postfix_value_expression函数调用21 postfix_value_expression:postfix_known_value_expression22 |unknown_expression23 value_expression:postfix_value_expression24 primary_type_expression:整数25 |'('type_expression')'26个postfix_known_type_expression:primary_type_expression27 |postfix_type_expression'*'28个postfix_type_expression:postfix_known_type_expression29 |unknown_expression30个type_expression:postfix_type_expression 

I have a particular construction that I'm having trouble debugging. I hope this aggressive simplification demonstrates the issue properly:

%token VAR IDENTIFIER NUMBER INT

%%
root:
    code_statements
    ;

code_statements:
    code_statement
    | code_statements code_statement
    ;

code_statement:
    var_statement
    | expression_statement
    ;

var_decl:
    IDENTIFIER ':' type_expression
    ;
var_statement:
    VAR var_decl ';'
    ;

expression_statement:
    value_expression ';'
    ;

function_call:
    '(' ')'
    | '(' value_expression_list ')'
    ;

value_expression_list:
    value_expression
    | value_expression_list ',' value_expression
    ;

/**** UNKNOWN EXPRESSIONS (VALUE/TYPE?) ****/

unknown_expression:
    IDENTIFIER
    | unknown_expression '.' IDENTIFIER
    | postfix_known_value_expression '.' IDENTIFIER
    | postfix_known_type_expression '.' IDENTIFIER
    ;

/**** VALUE EXPRESSION ****/

primary_value_expression:
    NUMBER
    | '(' value_expression ')'
    ;

postfix_known_value_expression:
    primary_value_expression
    | postfix_value_expression function_call
    ;

postfix_value_expression:
    postfix_known_value_expression
    | unknown_expression
    ;

value_expression:
    postfix_value_expression
    ;


/**** TYPE EXPRESSION ****/

primary_type_expression:
    INT
    | '(' type_expression ')'
    ;

postfix_known_type_expression:
    primary_type_expression
    | postfix_type_expression '*'
    ;

postfix_type_expression:
    postfix_known_type_expression
    | unknown_expression
    ;

type_expression:
    postfix_type_expression
    ;

%%

Error output:

State 11

   14 unknown_expression: unknown_expression . '.' IDENTIFIER
   22 postfix_value_expression: unknown_expression .
   29 postfix_type_expression: unknown_expression .

    '.'  shift, and go to state 26

    ';'       reduce using rule 22 (postfix_value_expression)
    ';'       [reduce using rule 29 (postfix_type_expression)]
    ')'       reduce using rule 22 (postfix_value_expression)
    ')'       [reduce using rule 29 (postfix_type_expression)]
    '*'       reduce using rule 29 (postfix_type_expression)
    $default  reduce using rule 22 (postfix_value_expression)

There are 2 problems here, the one related to the ) is easy to understand, but I'm struggling to think of an elegant way to solve it. The issue on the ;, I can't quite see how it happens (although I do know how to affect it), and I'm not sure how to tease the relevant information from the error to debug the actual problem.

In my actual grammar, there are a few more, but I think the patterns are basically the same as these 2.

解决方案

The basic issue here is the pair of productions shown in State 11: (Note: I used the command-line option --report=all instead of -v in order to see the lookahead computation.)

22 postfix_value_expression: unknown_expression .  [';', '(', ')', ',']
29 postfix_type_expression: unknown_expression .  [';', ')', '*']

Here you can see that the computed lookahead for both postfix_value_expression or postfix_type_expression includes ; and ).

The first of these is (somewhat) spurious; it is one of the rare cases in which the LALR algorithm conflates two states which actually have different lookaheads. (This issue is briefly discussed in the bison manual and at more length in most standard parsing textbooks.) If we ask bison to use the Canonical LR algorithm (%define lr.type canonical-lr), this conflict disappears, but the second one remains (in state 52):

22 postfix_value_expression: unknown_expression .  ['(', ')']
29 postfix_type_expression: unknown_expression .  [')', '*']

What we have here is a true ambiguity in your grammar. (I'm assuming that "types" can have "value" members (and vice-versa), as the grammar seems to indicate.)

Suppose the input includes

x.member

The parser has no need to know whether x is a type or a value. x will simply be parsed as an unknown_expression (production 13) and the index selection will also be an unknown_expression (production 14). [Note: I copied the number production rules from the output file at the end of this answer].

But since types can be parenthesized (production 25), the above could legitimately be written:

(x).member

Now the parser needs to apply either production 15 or 16 in order to derive unknown_expression, which means that it needs to reduce x using production 22 (to proceed with production 15) or production 29 (for production 16). It does not, however, known whether x is a type or a value, and it cannot know (at least, without consulting the symbol table). Although the two possible reductions are (presumably) equivalent, they are different productions, so the grammar is ambiguous. Hence the reduce/reduce conflict.

Looked at this way, the conflict is the result of what we might call premature categorization. Since the grammar doesn't actually care whether the left-hand operand of the index operator . is a type or a value, the left-hand operand does not need to be resolved at this point. That will be worked out by semantic analysis in some subsequent step. That being the case, the simplest solution is to not bother with distinguishing between types and values in the grammar, which will work out just fine for this simple language subset. A very simple grammar would suffice:

expression        : postfix_expression
postfix_expression: term
                  | postfix_expression '.' IDENTIFIER
                  | postfix_expression '(' expression_list ')'
                  | postfix_expression '*'
term              : IDENTIFIER
                  | '(' expression ')'

and any necessary semantic checks (to ensure that the expression is the right kind) could be done in a posterior tree walk, once all the identifiers have been categorized.

It is useful to ask whether it is ever necessary to grammatically resolve the category of an expression. Is it actually required to disambiguate the parse?

In C (and friends), the language really is ambiguous unless identifiers are categorized. For example, (x)*y is either:

  • a cast of the dereferenced pointer y, or
  • the product of the scalars x and y

A correct parse tree cannot be produced without knowing the category of x. (To make this even clearer, consider z/(x)*y, where the shape of the parse tree is radically different between the two alternatives. There are many other examples.)

The usual solution for producing an accurate parse tree for C, is to resort to lexical feedback, which in turn requires declaration before use, at least to the extent of categorization. (Even C++ insists that type aliases be declared before use, even though it allows class members to be used before declaration.)

A reasonable alternative is to use something like a GLR parser, which produces both (or all) parse tree. (This is usually called a parse forest.) A subsequent semantic analysis pass could prune the parse forest once identifiers have been categorized, although it is still necessary that the categorization is unambiguous for all parse trees in the forest. [Note 1]

Avoiding the above ambiguity is mostly a question of language design; had the syntax for C casts used square brackets instead of parentheses, for example, the problem could have been eliminated. (Of course, that would have used up the syntactic space which C++ used for lambdas.) Nonetheless, it is easy to imagine a language whose grammar is unambiguous even without knowing the categories of each identifier, but which nonetheless requires consideration of expression domains because of syntactic divergence. It seems likely that your language will be of this form. (In some sense, this is top-down categorization rather than bottom-up categorization: a declaration and an expression statement have different parsing contexts.)

For example, it is easy to imagine that you anticipate using * in a similar manner to the C family, as both a prefix dereference operator and infix multiplication operator. In combination with the postfix type construction operator shown above, that ends up with a single operator being used for prefix, postfix and infix. If all the uses are tossed indiscriminately into the same bag, that is necessarily ambiguous: a single token can represent at most two of operand, infix operator, prefix operator and postfix operator. If * could be any of the three operator types, then a * * b could mean either (a*) * b or a * (*b).

But that expression is still possible to parse unambiguously even without knowing the category of identifiers, because we know that neither the prefix nor the infix * operators can be applied to the result of the postfix * type constructor. Consequently, (a*) * b is impossible, and the only valid parse is a * (*b).

Unfortunately, since an LR(1) parser needs to decide whether or not to reduce on the basis of a single lookahead token, the decision as to whether a* should be reduced to a type or not needs to be made when the second * is encountered, which is not possible because (a * *) is a valid type expression. And worse, there is no limit to the number of asterisks in the expression, which might actually be a*…*[b]. We can disambiguate based on the token following the last asterisk, but that requires arbitrary lookahead.

Given the above, the best solution is probably a GLR parser, which doesn't have a lookahead limitation, combined with enough syntax to at least divide the contexts in which operators like * could be used.

We can't just throw a GLR parser at the grammar as presented in the OP, because -- as mentioned above -- it is definitely ambiguous even though the ambiguity is unimportant. Fortunately, we can create an unambiguous grammar by avoiding premature categorization.

The problem with parenthesized expressions, as shown above, is that the grammar requires them to be either type or value expressions, even in cases where it doesn't care which. To avoid that, we need to allow the possibility of delaying the decision until and if it is necessary by creating a third type of parenthesized expression. That results in the following grammar, which is actually LALR(1).

I separated out the expression productions because otherwise the boilerplate is overwhelming. Each production effectively indicates the expected category of the arguments, the grammatical precedence of the operator, and the category of the result (or unknown). The left-hand side of each production is known_category_precedence, asserting that the grammar requires that category, or unknown_precedence, indicating that the grammar by itself does not predict a category. (For example, '(expr).b' might be a type or a value, so the select operator's result is unknown_postfix.)

On the left-hand side, you will see non-terminals of the form category_precedence, which asserts that the semantic analysis must result in an object with that category. These non-terminals are really just for convenience; each one is defined simply as

category_precedence: known_category_precedence | unknown_precedence

in which the second alternative indicates the need for a semantic check. If that were possible during the parse, it would be inserted in the action for the second alternative. Alternatively, a semantic check node could be inserted into the AST.

I included these convenience productions in the boiler-plate but most of them are commented out because bison complains when a non-terminal is not used anywhere in the grammar.

%token VAR "var"
%token IDENTIFIER TYPE NUMBER
%%
program            : %empty
                   | program statement
statement          : var_declaration
                   | expr_statement
var_declaration    : "var" IDENTIFIER ':' type ';'
expr_statement     : value ';'
 /* Productions for expression syntaxes */
parameters         : '(' ')'
                   | '(' value_list ')'
value_list         : value
                   | value_list ',' value
known_value_postfix: value_postfix parameters    { /* function call */ }
unknown_postfix    : any_postfix '.' IDENTIFIER  { /* select */ }
known_type_postfix : type_postfix '*'            { /* pointer type constructor */ }

 /* Primary terms */
unknown_primary    : IDENTIFIER
                   | '(' unknown ')'
known_value_primary: NUMBER
                   | '(' known_value ')'
known_type_primary : TYPE
                   | '(' known_type ')'
 /* Boilerplate precedence grammar with two infix precedence levels */
unknown_postfix    : unknown_primary
known_value_postfix: known_value_primary
known_type_postfix : known_type_primary
value_postfix      : known_value_postfix | unknown_postfix
type_postfix       : known_type_postfix  | unknown_postfix
any_postfix        : known_value_postfix | known_type_postfix | unknown_postfix

unknown_prefix     : unknown_postfix
known_value_prefix : known_value_postfix
known_type_prefix  : known_type_postfix
/* value_prefix    : known_value_prefix | unknown_prefix */
/* type_postfix    : known_type_prefix  | unknown_prefix */

unknown_infix9     : unknown_prefix
known_value_infix9 : known_value_prefix
known_type_infix9  : known_type_prefix
/* value_infix9    : known_value_infix9 | unknown_infix9 */
/* type_infix9     : known_type_infix9  | unknown_infix9 */

unknown_infix8     : unknown_infix9
known_value_infix8 : known_value_infix9
known_type_infix8  : known_type_infix9
/* value_infix9    : known_value_infix8 | unknown_infix8 */
/* type_infix9     : known_type_infix8  | unknown_infix8 */

/* The last stanza is mostly for convenience but it also serves
 * to avoid the spurious reduce/reduce conflict on ';'.
 */
unknown            : unknown_infix8
known_value        : known_value_infix8
known_type         : known_type_infix8
value              : known_value | unknown
type               : known_type | unknown

With that framework in place, we can start adding other types of expression productions.

First, let's confront the three-way conflict on *. The basic expression productions are quite simple, based on the pattern described above:

known_value_prefix : '*' value_prefix            { /* dereference */ }
known_value_infix9 : value_infix9 '*' value_prefix { /* produce */ }

(We also need to uncomment the value_prefix and value_infix9 productions.)

Although the language is still unambiguous, it is no longer LALR(1) (or even LR(k) for any k), as indicated above in the discussion of this syntax. So bison will complain that there is a reduce-reduce conflict. We can't fix that complaint, but we can easily produce a working parser; all we need to do is insert a request that bison generate a GLR parser:

%glr-parser

That doesn't suppress the conflict warning [Note 2], but it does produce a working parser. (bison cannot verify that the grammar is unambiguous because there is no precise algorithm for doing so.) Since we cannot prove the grammar is unambiguous, we need to do extensive testing. If the GLR parser encounters an ambiguity, it will produce an error message, so we can be reassured when we don't see any error:

$ ./typevar3
(*a).b;
[EXP [ASVALUE [SELECT [DEREF [ASVALUE a]] b]]]
a*b;
[EXP [PROD [ASVALUE a] [ASVALUE b]]]
a**b;
[EXP [PROD [ASVALUE a] [DEREF [ASVALUE b]]]]
(a***b).c;
[EXP [ASVALUE [SELECT [PROD [ASVALUE a] [DEREF [DEREF [ASVALUE b]]]] c]]]
(a***).c;
[EXP [ASVALUE [SELECT [POINTER [POINTER [POINTER [ASTYPE a]]]] c]]]

Now, let's suppose we want to add an array type constructor, somewhat similar to C. We'll allow the form:

type[count]  // eg. int[3]

Note that the form is syntactically similar to an array index operation (a[2]), which we'll also add to the language.

This is a slightly different case from the select operator. In the case of the select operator, the grammar permits either values or types to be selected from, and cannot predict the result. In the case of array constructing/indexing, the category of the result is precisely the category of the first argument.

Because we need to "pass through" the category of the first argument, we need three productions, rather like the parenthesis productions:

known_value_postfix: known_value_postfix '[' value ']'
known_type_postfix : known_type_postfix '[' value ']'
unknown_postfix    : unknown_postfix '[' value ']'

In the third case, we'll need to insert an "construct_or_index" node into the AST. That could be resolved into a construct node or an index node by a later unit production which converts an unknown category into some specific category, or it could be left for the semantic analysis phase.

Adding these three productions produces no problems. However, now we would like to add a syntax for constructing variable-sized arrays, and we will choose the syntax int[*], generating yet another incompatible use of the * lexeme. That syntax has a definite result category, since it is not a valid index expression, so we can write the production:

known_type_postfix : type_postfix '[' '*' ']'

This time, we chose to use a convenience non-terminal on the right-hand side. Of course, that produces a raft of new grammatical conflicts, both shift-reduce and reduce-reduce, but the grammar continues to work as desired:

var a: (int*)[*];
[DECL a [ARRAY [POINTER [TYPE int]]]]
var a: (int*)[2*2];        
[DECL a [ARRAY [POINTER [TYPE int]] [PROD [VALUE 2] [VALUE 2]]]]
a[2];
[EXP [ASVALUE [INDEX_OR_ARRAY a [VALUE 2]]]]
(*a)[2]; 
[EXP [INDEX [DEREF [ASVALUE a]] [VALUE 2]]]

Left as an exercise: use the infix + operator to represent the scalar addition of two values, and the union of two types. Note that you will have to deal with the cases where just one of the two operands has a known category (and therefore the other operand must be coerced into the same category), as well as the case where neither operand has a known category.


Notes

  1. The GLR parser implemented by bison is not ideal for this purpose because it wants to produce a single parse tree. It is possible to manually create a parse forest with the bison-generated parser by using %merge declarations (as described in the manual), but this does not implement the space-efficient graph-representation which other GLR parsers can produce.

  2. The bison manual suggests using the %expect-rr directive to do that, but IMHO that should only be done once the grammar is production-ready. Unfortunately, you can only suppress a known count of conflicts, rather than suppressing particular expected conflicts; I suppose this is because it is difficult to precisely describe a single expected conflict, but the end result is that suppressing conflicts makes it easy to miss issues while you are developing the grammar. Verifying that the conflicts are the expected ones is annoying, but less annoying than missing an error in the grammar.

  3. The original grammar with numbered productions:

     1 root: code_statements
     2 code_statements: code_statement
     3                | code_statements code_statement
     4 code_statement: var_statement
     5               | expression_statement
     6 var_decl: IDENTIFIER ':' type_expression
     7 var_statement: VAR var_decl ';'
     8 expression_statement: value_expression ';'
     9 function_call: '(' ')'
    10              | '(' value_expression_list ')'
    11 value_expression_list: value_expression
    12                      | value_expression_list ',' value_expression
    13 unknown_expression: IDENTIFIER
    14                   | unknown_expression '.' IDENTIFIER
    15                   | postfix_known_value_expression '.' IDENTIFIER
    16                   | postfix_known_type_expression '.' IDENTIFIER
    17 primary_value_expression: NUMBER
    18                         | '(' value_expression ')'
    19 postfix_known_value_expression: primary_value_expression
    20                               | postfix_value_expression function_call
    21 postfix_value_expression: postfix_known_value_expression
    22                         | unknown_expression
    23 value_expression: postfix_value_expression
    24 primary_type_expression: INT
    25                        | '(' type_expression ')'
    26 postfix_known_type_expression: primary_type_expression
    27                              | postfix_type_expression '*'
    28 postfix_type_expression: postfix_known_type_expression
    29                        | unknown_expression
    30 type_expression: postfix_type_expression
    

这篇关于野牛减少/减少情况的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆