为什么 antlr4 将我的句子解析为两个语句? [英] Why is antlr4 parsing my sentence into two statements?

查看:31
本文介绍了为什么 antlr4 将我的句子解析为两个语句?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为表达式编写一个小解析器.目前我只是想让它识别二进制乘法 (myId * myId) 和类似 C 的解引用指针 (*myId),加上一些赋值语句 (myId*= myId).

I am writing a little parser for expressions. At the moment I just want it to recognize binary multiplications (myId * myId) and C-like dereferenced pointers (*myId), plus some assignation statements (myId *= myId).

使解析器抛出错误的输入是:

The input that makes the parser throw errors is:

x *= y;

...解析器失败并显示此消息和解析树:

... on which the parser fails with this message and parse tree:

[line 1:1 mismatched input ' *' expecting {';', NEWLINE}]
(sourceFile (statement (expressionStatement (expression (monoOperatedExpression (atomicExpression x))))  * =  ) (statement (expressionStatement (expression (monoOperatedExpression (atomicExpression y)))) ;) <EOF>)

我一直在挠头一段时间,但我看不出我的语法有什么问题(见下文).请问有什么提示吗?提前致谢.

I've been scratching my head for a while but I can't see what is wrong in my grammar (see it below). Any hints, please? Thanks in advance.

grammar Sable;

options {}

@header {
    package org.sable.parser;
}

ASSIGNMENT_OP:
    '='
    ;

BINARY_OP:
    '*'
    ;

WS_BUT_NOT_NEWLINE:
    WhiteSpaceButNotNewLineCharacter
    ;

NEWLINE:
    ('\u000D' '\u000A')
    | '\u000A'
    ;

WSA_BINARY_OP:
    (WS_BUT_NOT_NEWLINE+ BINARY_OP WS_BUT_NOT_NEWLINE+)
    | BINARY_OP
    ;

WSA_PREFIX_OP:
    (WS_BUT_NOT_NEWLINE+ '*' )
    ;

WS  :  WhiteSpaceCharacter+ -> skip
    ;

IDENTIFIER:
    (IdentifierHead IdentifierCharacter*)
    | ('`'(IdentifierHead IdentifierCharacter*)'`')
    ;

// NOTE: a file with zero statements is allowed because
// it can contain just comments.
sourceFile:
    statement* EOF;

statement:
    expressionStatement (';' | NEWLINE);

// Req. not existing any valid expression starting from
// an equals sign or any other assignment operator.
expressionStatement:
    expression (assignmentOperator expression)?;

expression:
    monoOperatedExpression (binaryOperator monoOperatedExpression)?
    ;

monoOperatedExpression:
    atomicExpression
    ;

binaryOperator:
    WSA_BINARY_OP
    ;

atomicExpression:
    IDENTIFIER ('<' type (',' type)* '>')? //TODO: can this be a lsv?
    ;

type:
    IDENTIFIER
    ;

assignmentOperator:
    ASSIGNMENT_OP
    ;

fragment DecimalDigit:
    '0'..'9'
    ;

fragment IdentifierHead:
    'a'..'z'
    | 'A'..'Z'
    ;
fragment IdentifierCharacter:
    DecimalDigit
    | IdentifierHead
    ;

fragment WhiteSpaceCharacter:
    WhiteSpaceButNotNewLineCharacter
    | NewLineCharacter;

fragment WhiteSpaceButNotNewLineCharacter:
    [\u0020\u000C\u0009u000B\u000C]
    ;

fragment NewLineCharacter:
    [\u000A\u000D]
    ;

应评论者的要求添加新版本的语法.

adding a new version of the grammar on request of commenters.

grammar Sable;

options {}

@header {
    package org.sable.parser;
}

//
// PARSER RULES.

sourceFile              : statement* EOF;
statement               : expressionStatement (SEMICOLON | NEWLINE);
expressionStatement     : expression (ASSIGNMENT_OPERATOR expression)?;

expression:
    expression WSA_OPERATOR expression
    | expression OPERATOR expression
    | OPERATOR expression
    | expression OPERATOR
    | atomicExpression
    ;

atomicExpression:
    IDENTIFIER ('<' type (',' type)* '>')? //TODO: can this be a lsv?
    ;

type                    : IDENTIFIER;


//
// LEXER RULES.

COMMENT                 : '/*' .*? '*/'                    -> channel(HIDDEN);
LINE_COMMENT            : '//' ~[\000A\000D]*              -> channel(HIDDEN);

ASSIGNMENT_OPERATOR     : Operator? '=';

// WSA = White Space Aware token.
// These are tokens that occurr in a given whitespace context.
WSA_OPERATOR:
    (WhiteSpaceNotNewline+ Operator WhiteSpaceNotNewline+)
    ;

OPERATOR         : Operator;

// Newline chars are defined apart because they carry meaning as a statement
// delimiter.
NEWLINE:
    ('\u000D' '\u000A')
    | '\u000A'
    ;

WS                      : WhiteSpaceNotNewline -> skip;

SEMICOLON               : ';';


IDENTIFIER:
    (IdentifierHead IdentifierCharacter*)
    | ('`'(IdentifierHead IdentifierCharacter*)'`')
    ;

fragment DecimalDigit   :'0'..'9';

fragment IdentifierHead:
    'a'..'z'
    | 'A'..'Z'
    | '_'
    | '\u00A8'
    | '\u00AA'
    | '\u00AD'
    | '\u00AF' |
    '\u00B2'..'\u00B5' |
    '\u00B7'..'\u00BA'  |
    '\u00BC'..'\u00BE' |
    '\u00C0'..'\u00D6' |
    '\u00D8'..'\u00F6' |
    '\u00F8'..'\u00FF' |
    '\u0100'..'\u02FF' |
    '\u0370'..'\u167F' |
    '\u1681'..'\u180D' |
    '\u180F'..'\u1DBF' |
    '\u1E00'..'\u1FFF' |
    '\u200B'..'\u200D' |
    '\u202A'..'\u202E' |
    '\u203F'..'\u2040' |
    '\u2054' |
    '\u2060'..'\u206F' |
    '\u2070'..'\u20CF' |
    '\u2100'..'\u218F' |
    '\u2460'..'\u24FF' |
    '\u2776'..'\u2793' |
    '\u2C00'..'\u2DFF' |
    '\u2E80'..'\u2FFF' |
    '\u3004'..'\u3007' |
    '\u3021'..'\u302F' |
    '\u3031'..'\u303F' |
    '\u3040'..'\uD7FF' |
    '\uF900'..'\uFD3D' |
    '\uFD40'..'\uFDCF' |
    '\uFDF0'..'\uFE1F' |
    '\uFE30'..'\uFE44' |
    '\uFE47'..'\uFFFD'
    ;
fragment IdentifierCharacter:
    DecimalDigit
    | '\u0300'..'\u036F'
    | '\u1DC0'..'\u1DFF'
    | '\u20D0'..'\u20FF'
    | '\uFE20'..'\uFE2F'
    | IdentifierHead
    ;
// Non-newline whitespaces are defined apart because they carry meaning in
// certain contexts, e.g. within space-aware operators.
fragment WhiteSpaceNotNewline    : [\u0020\u000C\u0009u000B\u000C];

fragment Operator:
    '*'
    | '/'
    | '%'
    | '+'
    | '-'
    | '<<'
    | '>>'
    | '&'
    | '^'
    | '|'
    ;

推荐答案

规则

expression
    : monoOperatedExpression (binaryOperator monoOperatedExpression)?
    ;

不允许在 binaryOperator 之后有 =.因此,运行时报告它不知道在使用 BINARY_OP 之后要使用的下一个规则.

does not permit an = after the binaryOperator. Accordingly, the runtime reports that it did not know what next rule to use following the consumption of the BINARY_OP.

可以通过一些重要的重组和最好的简化来修复语法.

The grammar can be fixed with some significant restructuring and, preferably, simplification.

1 - 忽略空格/换行符可以大大简化它.

1 - Whitespace/newline handling can be greatly simplified by ignoring it.

WS : [ \t\r\n] -> skip;

C 系列和 Python 类语言是上下文无关语言,具有一些众所周知的上下文敏感的极端情况.ANTLR 是一个上下文无关的解析器,具有许多处理上下文敏感性的便利功能.因此,忽略(或隐藏)空格应该是默认设置.

C-family and Python-like languages are context free languages with a few, well-known context sensitive corner cases. ANTLR is a context free parser with a number of convenience capabilities to handle context sensitivities. So, ignoring (or hiding) whitespace should be the default.

2 - 根据定义消除 * 的使用歧义:

2 - disambiguate the use of * by definition:

STAR_EQUAL : '*=' ;
STAR       : '*'  ;
EQUAL      : '='  ;

这可确保任何单个 STAR 仅可被视为指针标记或乘法运算符(序列 STAR WS EQUAL 在您的语言中无效或可能具有某些自定义含义).

This ensures that any single STAR is available to be considered only as a pointer mark or multiplication operator (the sequence STAR WS EQUAL is either invalid in your language or could have some custom meaning).

3 - 使用解析器规则递归:考虑C 语法,特别是从 expression 规则开始.简化的模式是:

3 - use parser rule recursion: consider the expression handling rules in the C grammar, specifically starting with the expression rule. The simplified pattern is:

expression     // list of all valid syntaxes for an `expression`
    : LPAREN expression RPAREN
    | expression ( COMMA expression )*
    | expression op expression 
    | unitary_op expression 
    | expression unitary_op 
    | << any other valid syntax >>
    | atom
    ;

 unitary_op : 2PLUS | 2DASH | .... ;
 op         : STAR_EQUAL | STAR | EQUAL | .... ;

 atom
    : STAR? IDENTIFIER   // pointer usage
    | NUMBER
    ;

以这种方式呈现,语法将更具可读性和可维护性.

Presented this way, the grammar will be far more readable and maintainable.

通过这些更改,完成语法的修订成为 OP 的一个简单练习(意思是,尝试并发布遇到的任何问题).

With these changes, completing the revision of the grammar is left as an easy exercise for the OP (meaning, try it and post any problems encountered).

奖励 - ANTLR 是一个自顶向下的解析器.因此,将解析器规则放在顶部,从宽到窄进行组织.其次是词法分析器规则,也以相同的方式排序,任何词法分析器模式,然后是最底部的片段规则.

Bonus - ANTLR is a top down parser. So, put the parser rules at the top, organized broad to narrow. Followed by the lexer rules, also ordered in the same way, any lexer modes, and then with fragment rules at the very bottom.

这种排序减轻了您和其他人理解语法的认知负担.例如,使树转储更容易/更快地理解.还将简化最终划分为拆分语法的任务(如果语法具有任何显着的复杂性,则推荐使用,并且如果存在模式则是必需的).

This ordering ease the cognitive load of understanding the grammar by you and others. For example, makes the tree dump easier/quicker to understand. Will also ease the task of eventually dividing into a split grammar (recommended if the grammar is of any significant complexity and required if there are modes).

完整语法

grammar Sable;

@header {
    package org.sable.parser.gen;
}

sable
    : statement* EOF
    ;

statement
    : expression? SEMI
    ;

expression
    : LPAREN expression RPAREN
    | COMMA expression
    | expression op expression
    | unitary_op expression
    | expression unitary_op
    | STAR? IDENTIFIER
    | NUMBER
    ;

 unitary_op
    : DPLUS | DMINUS
    ;

 op : STAR_EQUAL | DIV_EQUAL | PLUS_EQUAL | MINUS_EQUAL | EQUAL
    | STAR | DIV | PLUS | MINUS
    ;


COMMENT     : Comment -> skip ;

STAR_EQUAL  : '*=' ;
DIV_EQUAL   : '/=' ;
PLUS_EQUAL  : '+=' ;
MINUS_EQUAL : '-=' ;
EQUAL       : '='  ;

STAR        : '*'  ; // mult or pointer
DIV         : '/'  ;
PLUS        : '+'  ;
MINUS       : '-'  ;

DPLUS       : '++' ;
DMINUS      : '--' ;

COMMA       : ','  ;
DOT         : '.'  ;
SEMI        : ';'  ;

LPAREN      : '('  ;
RPAREN      : ')'  ;
LBRACE      : '{'  ;
RBRACE      : '}'  ;
LBRACK      : '['  ;
RBRACK      : ']'  ;
LANGLE      : '<'  ;
RANGLE      : '>'  ;

NUMBER      : [0-9]+ ('.' [0-9]+)? ([eE] [+-]? [0-9]+)? ;
IDENTIFIER  : [a-zA-Z_][a-zA-Z0-9_-]*  ;

WS          : [ \t\r\n]+    -> skip;

ERRCHAR
    :   .   -> channel(HIDDEN)
    ;

fragment Comment
    :   '/*' .*? '*/'
    |   '//' ~[\r\n]*
    ;

生成但未经测试.报告任何未处理的极端情况.

Generates but is untested. Report back any corner cases not handled.

这篇关于为什么 antlr4 将我的句子解析为两个语句?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆