为什么antlr4将我的句子解析为两个语句? [英] Why is antlr4 parsing my sentence into two statements?

查看:109
本文介绍了为什么antlr4将我的句子解析为两个语句?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为表达式编写一个小解析器.目前,我只希望它能够识别二进制乘法( myId * myId )和类C的取消引用指针( * myId ),以及一些赋值语句( myId * = myId ).

使解析器抛出错误的输入是:

x *= y;

...,解析器在此失败并显示此消息并解析树:

[line 1:1 mismatched input ' *' expecting {';', NEWLINE}]
(sourceFile (statement (expressionStatement (expression (monoOperatedExpression (atomicExpression x))))  * =  ) (statement (expressionStatement (expression (monoOperatedExpression (atomicExpression y)))) ;) <EOF>)

我已经挠了一段时间,但是我看不出语法有什么问题(请参阅下面的内容).有什么提示吗?预先感谢.

grammar Sable;

options {}

@header {
    package org.sable.parser;
}

ASSIGNMENT_OP:
    '='
    ;

BINARY_OP:
    '*'
    ;

WS_BUT_NOT_NEWLINE:
    WhiteSpaceButNotNewLineCharacter
    ;

NEWLINE:
    ('\u000D' '\u000A')
    | '\u000A'
    ;

WSA_BINARY_OP:
    (WS_BUT_NOT_NEWLINE+ BINARY_OP WS_BUT_NOT_NEWLINE+)
    | BINARY_OP
    ;

WSA_PREFIX_OP:
    (WS_BUT_NOT_NEWLINE+ '*' )
    ;

WS  :  WhiteSpaceCharacter+ -> skip
    ;

IDENTIFIER:
    (IdentifierHead IdentifierCharacter*)
    | ('`'(IdentifierHead IdentifierCharacter*)'`')
    ;

// NOTE: a file with zero statements is allowed because
// it can contain just comments.
sourceFile:
    statement* EOF;

statement:
    expressionStatement (';' | NEWLINE);

// Req. not existing any valid expression starting from
// an equals sign or any other assignment operator.
expressionStatement:
    expression (assignmentOperator expression)?;

expression:
    monoOperatedExpression (binaryOperator monoOperatedExpression)?
    ;

monoOperatedExpression:
    atomicExpression
    ;

binaryOperator:
    WSA_BINARY_OP
    ;

atomicExpression:
    IDENTIFIER ('<' type (',' type)* '>')? //TODO: can this be a lsv?
    ;

type:
    IDENTIFIER
    ;

assignmentOperator:
    ASSIGNMENT_OP
    ;

fragment DecimalDigit:
    '0'..'9'
    ;

fragment IdentifierHead:
    'a'..'z'
    | 'A'..'Z'
    ;
fragment IdentifierCharacter:
    DecimalDigit
    | IdentifierHead
    ;

fragment WhiteSpaceCharacter:
    WhiteSpaceButNotNewLineCharacter
    | NewLineCharacter;

fragment WhiteSpaceButNotNewLineCharacter:
    [\u0020\u000C\u0009u000B\u000C]
    ;

fragment NewLineCharacter:
    [\u000A\u000D]
    ;

编辑:应评论员的要求添加语法的新版本.

grammar Sable;

options {}

@header {
    package org.sable.parser;
}

//
// PARSER RULES.

sourceFile              : statement* EOF;
statement               : expressionStatement (SEMICOLON | NEWLINE);
expressionStatement     : expression (ASSIGNMENT_OPERATOR expression)?;

expression:
    expression WSA_OPERATOR expression
    | expression OPERATOR expression
    | OPERATOR expression
    | expression OPERATOR
    | atomicExpression
    ;

atomicExpression:
    IDENTIFIER ('<' type (',' type)* '>')? //TODO: can this be a lsv?
    ;

type                    : IDENTIFIER;


//
// LEXER RULES.

COMMENT                 : '/*' .*? '*/'                    -> channel(HIDDEN);
LINE_COMMENT            : '//' ~[\000A\000D]*              -> channel(HIDDEN);

ASSIGNMENT_OPERATOR     : Operator? '=';

// WSA = White Space Aware token.
// These are tokens that occurr in a given whitespace context.
WSA_OPERATOR:
    (WhiteSpaceNotNewline+ Operator WhiteSpaceNotNewline+)
    ;

OPERATOR         : Operator;

// Newline chars are defined apart because they carry meaning as a statement
// delimiter.
NEWLINE:
    ('\u000D' '\u000A')
    | '\u000A'
    ;

WS                      : WhiteSpaceNotNewline -> skip;

SEMICOLON               : ';';


IDENTIFIER:
    (IdentifierHead IdentifierCharacter*)
    | ('`'(IdentifierHead IdentifierCharacter*)'`')
    ;

fragment DecimalDigit   :'0'..'9';

fragment IdentifierHead:
    'a'..'z'
    | 'A'..'Z'
    | '_'
    | '\u00A8'
    | '\u00AA'
    | '\u00AD'
    | '\u00AF' |
    '\u00B2'..'\u00B5' |
    '\u00B7'..'\u00BA'  |
    '\u00BC'..'\u00BE' |
    '\u00C0'..'\u00D6' |
    '\u00D8'..'\u00F6' |
    '\u00F8'..'\u00FF' |
    '\u0100'..'\u02FF' |
    '\u0370'..'\u167F' |
    '\u1681'..'\u180D' |
    '\u180F'..'\u1DBF' |
    '\u1E00'..'\u1FFF' |
    '\u200B'..'\u200D' |
    '\u202A'..'\u202E' |
    '\u203F'..'\u2040' |
    '\u2054' |
    '\u2060'..'\u206F' |
    '\u2070'..'\u20CF' |
    '\u2100'..'\u218F' |
    '\u2460'..'\u24FF' |
    '\u2776'..'\u2793' |
    '\u2C00'..'\u2DFF' |
    '\u2E80'..'\u2FFF' |
    '\u3004'..'\u3007' |
    '\u3021'..'\u302F' |
    '\u3031'..'\u303F' |
    '\u3040'..'\uD7FF' |
    '\uF900'..'\uFD3D' |
    '\uFD40'..'\uFDCF' |
    '\uFDF0'..'\uFE1F' |
    '\uFE30'..'\uFE44' |
    '\uFE47'..'\uFFFD'
    ;
fragment IdentifierCharacter:
    DecimalDigit
    | '\u0300'..'\u036F'
    | '\u1DC0'..'\u1DFF'
    | '\u20D0'..'\u20FF'
    | '\uFE20'..'\uFE2F'
    | IdentifierHead
    ;
// Non-newline whitespaces are defined apart because they carry meaning in
// certain contexts, e.g. within space-aware operators.
fragment WhiteSpaceNotNewline    : [\u0020\u000C\u0009u000B\u000C];

fragment Operator:
    '*'
    | '/'
    | '%'
    | '+'
    | '-'
    | '<<'
    | '>>'
    | '&'
    | '^'
    | '|'
    ;

解决方案

规则

expression
    : monoOperatedExpression (binaryOperator monoOperatedExpression)?
    ;

不允许在binaryOperator之后的=.因此,运行时报告说,在使用BINARY_OP之后,它不知道下一步要使用什么规则.

可以通过一些重要的重组来固定语法,最好是简化.

1-通过忽略空格/换行符,可以大大简化其处理.

WS : [ \t\r\n] -> skip;

C系列和类Python语言是上下文无关的语言,具有一些众所周知的上下文相关的极端情况. ANTLR是上下文无关的解析器,具有许多便利功能来处理上下文敏感度.因此,忽略(或隐藏)空格应该是默认值.

2-按定义消除对*的使用:

STAR_EQUAL : '*=' ;
STAR       : '*'  ;
EQUAL      : '='  ;

这可确保任何单个STAR都可被视为仅用作指针标记或乘法运算符(序列STAR WS EQUAL在您的语言中无效或可能具有某些自定义含义).

3-使用解析器规则递归:考虑 解决方案

The rule

expression
    : monoOperatedExpression (binaryOperator monoOperatedExpression)?
    ;

does not permit an = after the binaryOperator. Accordingly, the runtime reports that it did not know what next rule to use following the consumption of the BINARY_OP.

The grammar can be fixed with some significant restructuring and, preferably, simplification.

1 - Whitespace/newline handling can be greatly simplified by ignoring it.

WS : [ \t\r\n] -> skip;

C-family and Python-like languages are context free languages with a few, well-known context sensitive corner cases. ANTLR is a context free parser with a number of convenience capabilities to handle context sensitivities. So, ignoring (or hiding) whitespace should be the default.

2 - disambiguate the use of * by definition:

STAR_EQUAL : '*=' ;
STAR       : '*'  ;
EQUAL      : '='  ;

This ensures that any single STAR is available to be considered only as a pointer mark or multiplication operator (the sequence STAR WS EQUAL is either invalid in your language or could have some custom meaning).

3 - use parser rule recursion: consider the expression handling rules in the C grammar, specifically starting with the expression rule. The simplified pattern is:

expression     // list of all valid syntaxes for an `expression`
    : LPAREN expression RPAREN
    | expression ( COMMA expression )*
    | expression op expression 
    | unitary_op expression 
    | expression unitary_op 
    | << any other valid syntax >>
    | atom
    ;

 unitary_op : 2PLUS | 2DASH | .... ;
 op         : STAR_EQUAL | STAR | EQUAL | .... ;

 atom
    : STAR? IDENTIFIER   // pointer usage
    | NUMBER
    ;

Presented this way, the grammar will be far more readable and maintainable.

With these changes, completing the revision of the grammar is left as an easy exercise for the OP (meaning, try it and post any problems encountered).

Bonus - ANTLR is a top down parser. So, put the parser rules at the top, organized broad to narrow. Followed by the lexer rules, also ordered in the same way, any lexer modes, and then with fragment rules at the very bottom.

This ordering ease the cognitive load of understanding the grammar by you and others. For example, makes the tree dump easier/quicker to understand. Will also ease the task of eventually dividing into a split grammar (recommended if the grammar is of any significant complexity and required if there are modes).

Full Grammar

grammar Sable;

@header {
    package org.sable.parser.gen;
}

sable
    : statement* EOF
    ;

statement
    : expression? SEMI
    ;

expression
    : LPAREN expression RPAREN
    | COMMA expression
    | expression op expression
    | unitary_op expression
    | expression unitary_op
    | STAR? IDENTIFIER
    | NUMBER
    ;

 unitary_op
    : DPLUS | DMINUS
    ;

 op : STAR_EQUAL | DIV_EQUAL | PLUS_EQUAL | MINUS_EQUAL | EQUAL
    | STAR | DIV | PLUS | MINUS
    ;


COMMENT     : Comment -> skip ;

STAR_EQUAL  : '*=' ;
DIV_EQUAL   : '/=' ;
PLUS_EQUAL  : '+=' ;
MINUS_EQUAL : '-=' ;
EQUAL       : '='  ;

STAR        : '*'  ; // mult or pointer
DIV         : '/'  ;
PLUS        : '+'  ;
MINUS       : '-'  ;

DPLUS       : '++' ;
DMINUS      : '--' ;

COMMA       : ','  ;
DOT         : '.'  ;
SEMI        : ';'  ;

LPAREN      : '('  ;
RPAREN      : ')'  ;
LBRACE      : '{'  ;
RBRACE      : '}'  ;
LBRACK      : '['  ;
RBRACK      : ']'  ;
LANGLE      : '<'  ;
RANGLE      : '>'  ;

NUMBER      : [0-9]+ ('.' [0-9]+)? ([eE] [+-]? [0-9]+)? ;
IDENTIFIER  : [a-zA-Z_][a-zA-Z0-9_-]*  ;

WS          : [ \t\r\n]+    -> skip;

ERRCHAR
    :   .   -> channel(HIDDEN)
    ;

fragment Comment
    :   '/*' .*? '*/'
    |   '//' ~[\r\n]*
    ;

Generates but is untested. Report back any corner cases not handled.

这篇关于为什么antlr4将我的句子解析为两个语句?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆