ANTLR4行注释和文本解析问题 [英] ANTLR4 line comments and text parsing issue

查看：81 发布时间：2021/4/7 20:28:56 antlr4

本文介绍了ANTLR4行注释和文本解析问题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在编写c ++标题样式文件的解析器，并通过正确的行注释处理来面对这个问题.

CustomLexer.g4

 词法分析器语法CustomLexer;空间:[\ r \ n \ t] +->跳过;COMMENT_START:"//"->pushMode(COMMENT_MODE);PRAGMA:'#pragma';部分:"@ section"；定义:'#define';UNDEF:'#undef';如果如果';ELIF:'#elif';ELSE:'#else';IFDEF:'#ifdef';IFNDEF:'#ifndef';ENDIF:'#endif';已启用:已启用"；禁用:'禁用';EITHER:"EITHER"；ANY:"ANY"；定义:'定义';BOTH:'BOTH';BOOLEAN_LITERAL:"true" |'错误的';STRING:".*?''';';十六进制:'0x'([a-fA-F0-9])+;LITERAL_SUFFIX:'L'|'u'|'U'|'Lu'|'LU'|'uL'|'UL'|'f'|'F';标识符:[a-zA-Z_] [a-zA-Z_0-9] *；BLOCK_COMMENT:"/**".*?'*/';数字          : ('-')?Int('.'Digit *)?|'0';CHAR_SEQUENCE:[a-zA-Z_] [a-zA-Z_0-9.] *；ARRAY_SEQUENCE:'{'.*?'}';OPAREN:'(';CPAREN:')';OBRACE:"{"；CBRACE:}"；添加:'+';减去        : '-';乘:'*';除法:'/';模数:'％';或:'||';AND:'&&';;等于:'==';NEQUALS:'！=';GTEQUALS:'> =';LTEQUALS:'< =';GT:'>';LT:<"；EXCL:'！';QMARK:?"；冒号           : ':';COMA:'，';其他           : .;片段整数:[0-9]数字* |'0';片段位数:[0-9]；模式COMMENT_MODE;COMMENT_MODE_DEFINE:'#define'->type(DEFINE)，popMode;COMMENT_MODE_SECTION:"@ section"->type(SECTION)，popMode;COMMENT_MODE_IF:'#if'->type(IF)，popMode;COMMENT_MODE_ENDIF:#endif"->type(ENDIF)，popMode;COMMENT_MODE_LINE_BREAK:[\ r \ n] +->跳过，popMode;COMMENT_MODE_PART:〜[\ r \ n];

CustomParser.g4:

 解析器语法CustomParser;选项{tokenVocab = CustomLexer;}CompilationUnit:声明* EOF;陈述: 评论?指令|评论?defineDirective|评论?undefDirective|评论?ifDirective|评论?ifdefDirective|评论?ifndefDirective|sectionLineComment|评论;指令:PRAGMA char_sequence;子指令:ifDirective +|ifdefDirective +|ifndefDirective +|defineDirective +|undefDirective +|评论+;ifdefDirective:IFDEF IDENTIFIER子指令+ ENDIF;ifndefDirective:IFNDEF IDENTIFIER子指令+ ENDIF;ifDirective:ifStatement elseIfStatement * elseStatement?万一;ifStatement:IF表达式(子指令)*;elseIf声明:ELIF表达式(子指令)*;elseStatement:ELSE(子指令)*;defineDirective:BLOCK_COMMENT?COMMENT_START?DEFINE IDENTIFIER BOOLEAN_LITERAL info_comment?|BLOCK_COMMENT?COMMENT_START?DEFINE IDENTIFIER(char_sequence COMA?)+ info_comment?|BLOCK_COMMENT?COMMENT_START?定义欧本吗?NUMBER LITERAL_SUFFIX?CPAREN?info_comment?|BLOCK_COMMENT?COMMENT_START?DEFINE IDENTIFIER HEXADECIMAL info_comment?|BLOCK_COMMENT?COMMENT_START?DEFINE IDENTIFIER STRING info_comment?|BLOCK_COMMENT?COMMENT_START?定义标识符的网络?(ARRAY_SEQUENCE COMA?)+ CBRACE?info_comment?|BLOCK_COMMENT?COMMENT_START?DEFINE IDENTIFIER表达式info_comment?|BLOCK_COMMENT?COMMENT_START?DEFINE IDENTIFIER info_comment?;undefDirective:BLOCK_COMMENT?COMMENT_START?UNDEF IDENTIFIER info_comment ?;sectionLineComment:COMMENT_START COMMENT_MODE_PART?部分char_sequence;评论:BLOCK_COMMENT|line_comment +;表达:simpleExpression|customExpression|enabledExpression|disabledExpression|bothExpression|任一种表达|anyExpression|defineExpression|comparisonExpression|算术表达式;算术表达式:算术表达式(MULTIPLY |除)算术表达式|算术表达式(ADD | SUBTRACT)算术表达式|OPAREN算术表达式CPAREN|expressionIdentifier;comparisonExpression:compareExpression(EQUALS | NEQUALS | GTEQUALS | LTEQUALS | GT | LT)比较表达式|comparisonExpression(AND | OR)比较表达式|EXCL?OPAREN比较表达式CPAREN|任一种表达|enabledExpression|bothExpression|anyExpression|defineExpression|disabledExpression|customExpression|simpleExpression|expressionIdentifier;enabledExpression:EXCL?欧本?是否启用了OPAREN标识符CPAREN CPAREN?disabledExpression:EXCL吗?欧本?禁用的OPAREN标识符CPAREN CPAREN?bothExpression:EXCL吗?欧本?两个OPAREN标识符标识符CPAREN CPAREN ?;anyExpression:EXCL吗?欧本?是否还有OPAREN标识符+ CPAREN CPAREN ?;anyExpression:EXCL吗?欧本?是否有OPAREN标识符+ CPAREN CPAREN?defineExpression:EXCL?欧本?定义的OPAREN标识符CPAREN CPAREN?customExpression:EXCL吗?识别符OPAREN识别符CPAREN;simpleExpression:EXCL?标识符；expressionIdentifier:IDENTIFIER |数字;身份标识:识别昏迷?;line_comment:COMMENT_START COMMENT_MODE_PART *;info_comment:COMMENT_START COMMENT_MODE_PART *;char_sequence:CHAR_SEQUENCE|识别码;

我的头文件中有95％的指令和注释都可以正常工作，但是很少有情况仍然无法正确处理:

1.行注释

输入:

 //1//#define ID1//2

这是令牌列表:

  01.CompilationUnit02.声明:203.评论:204. line_comment05. COMMENT_START:"//"06. COMMENT_MODE_PART:"1"07. line_comment08. COMMENT_START:"//"09. defineDirective:810.定义:#define"；11.标识符:"ID1"12. info_comment13. COMMENT_START:"//"14. COMMENT_MODE_PART:"2"；15.< EOF>

我想实现第07行的令牌是第09行的令牌的一部分，并解析为COMMENT_START令牌

2.使用文字定义指令

其他定义规则正常运行，但是:

  #define USER_DESC_2" abc"DEF"ABC2"M100(100)#define USER_GCODE_2"M140 S"STRINGIFY(PREHEAT_1_TEMP_BED)"\ nM104 S"STRINGIFY(PREHEAT_1_TEMP_HOTEND)

这些定义"指令正在解析异常

对于解决我目前遇到的这两个问题的任何帮助，或者对如何优化我的词法分析器/解析器的任何建议，我将不胜感激.

提前谢谢！

=================================更新===================================第一个测试用例:

输入:

 //1//#define ID1//2

当前结果:

  01.CompilationUnit02.声明:203.评论:204. line_comment05. COMMENT_START:"//"06. COMMENT_MODE_PART:"1"07. line_comment08. COMMENT_START:"//"09. defineDirective:810.定义:11.标识符:"ID1"12. info_comment13. COMMENT_START:"//"14. COMMENT_MODE_PART:"2"；15.< EOF>

预期结果:

  01.CompilationUnit02.声明:203.评论:204. line_comment05. COMMENT_START:"//"06. COMMENT_MODE_PART:"1"07. defineDirective:808. COMMENT_START:"//"09.定义:#define"10.标识符:"ID1"11. info_comment12. COMMENT_START:"//"13. COMMENT_MODE_PART:"2"；14.< EOF>

第二个测试用例:

输入:

  #define USER_DESC_2预热"PREHEAT_1_LABEL

当前结果:

  01.compilationUnit02.声明:203. defineDirective:504.定义:#define"05. IDENTIFIER:"USER_DESC_2"06.STRING:将"\"预热为".07. IDENTIFIER:"PREHEAT_1_LABEL"< EOF>

预期结果:

  01.compilationUnit02.声明:203. defineDirective:504.定义:#define"05. IDENTIFIER:"USER_DESC_2"06. STRING:将"\"预热为".PREHEAT_1_LABEL"< EOF>

在预期结果中， STRING 代表结果文本.在这里，我真的不知道增强 STRING Lexer令牌定义还是引入新的解析规则来解决这种情况是否更好

解决方案

混合本文，您先前的问题和Bart的答案，并假设define指令的形式为

  optional _//#定义IDENTIFIER替换值optional_line_comment

并输入文件 input.txt

 /***大块评论*/#pragma一次//#pragma一次/***大块评论*/#define CONFIGURATION_H_VERSION 12345#define IDENTIFIER abcd#define IDENTIFIER_1 abcd#define IDENTIFIER_1 abcd.dd#define IDENTIFIER_2是//行#define IDENTIFIER_20 {ONE，TWO}//行#define IDENTIFIER_20_30 {1，2，3，4}#定义IDENTIFIER_20_30_A [1，2，3，4]#定义DEFAULT_A 10.0//================================================================//============================ INFO ==============================//================================================================/***单独的块评论*///第1行//第2行////=======================这是一个部分========================//@section测试//第3行#定义IDENTIFIER_TWO((一，二，三)")//第4行//#define IDENTIFIER_3 Version.h//第5行//第6行#定义IDENTIFIER_THREE//1//#define ID1//2#define USER_DESC_2预热"PREHEAT_1_LABEL#define USER_DESC_2" abc"DEF"ABC2"M100(100)#define USER_GCODE_2"M140 S"STRINGIFY(PREHEAT_1_TEMP_BED)"\ nM104 S"STRINGIFY(PREHEAT_1_TEMP_HOTEND)

如果我已经很好地理解了您的两个问题，那么语法必须为每个指令或注释都生成一个声明，而不是后面紧跟着一个指令.指令之前可以带有注释，该注释将成为语句的一部分.指令可以被注释掉，然后在行内注释(即在同一行).

语法 Header.g4 (无痕):

 语法标题；CompilationUnit@init {System.out.println("Last update 1253");}:(语句{System.out.println(找到的声明:`" + $ statement.text +`"")；})* EOF;陈述:   评论?pragma_directive|评论?define_directive|部分|评论;pragma_directive:PRAGMA char_sequence;define_directive:define_identifier replacement_comment [$ define_identifier.statement_line];define_identifier返回[int statement_line]:LINE_COMMENT_DELIMITER?定义{$ statement_line = getCurrentToken().getLine();} IDENTIFIER;replace_comment [int statement_line]:什么+ line_comment?|{getCurrentToken().getLine()== $ statement_line}?line_comment|{getCurrentToken().getLine()！= $ statement_line}?;部分:LINE_COMMENT_DELIMITER其他?部分char_sequence;评论:BLOCK_COMMENT|line_comment|分隔符(标识符|等于)*;line_comment:LINE_COMMENT_DELIMITER任何内容*;任何事物:识别码|CHAR_SEQUENCE|细绳|数字|其他;char_sequence:CHAR_SEQUENCE|识别码;LINE_COMMENT_DELIMITER:"//"；PRAGMA:'#pragma';部分:"@ section"；定义:'#define';STRING:".*?''';';等于:'='+;分隔符:LINE_COMMENT_DELIMITER个等分；标识符:[a-zA-Z_] [a-zA-Z_0-9] *；CHAR_SEQUENCE:[a-zA-Z_] [a-zA-Z_0-9.] *；NUMBER:[0-9.] +；BLOCK_COMMENT:"/**".*?'*/';WS:[\ t] +->频道(隐藏);NL:('\ r''\ n'?|'\ n')->频道(隐藏);其他         : .;

执行:

  $ export CLASSPATH =.:/usr/local/lib/antlr-4.9-complete.jar"$ alias a4 ='java -jar/usr/local/lib/antlr-4.9-complete.jar'$ alias grun ='java org.antlr.v4.gui.TestRig'$ a4 Header.g4$ javac标头* .java$ grun标头CompilationUnit-令牌input.txt[@ 0,0:23 ='/** \ n *块注释\ n */'，< BLOCK_COMMENT>，1:0][@ 1,24:24 ='\ n'，< NL> ;，频道= 1,3:3][@ 2,25:31 ='#pragma'，<'#pragma'>，4:0][@ 3,32:32 =''，< WS> ;，频道= 1,4:7][@ 4,33:36 ='一次'，< IDENTIFIER>，4:8][@ 5,37:37 ='\ n'，< NL> ;，频道= 1,4:12]...[@ 84,315:321 ='#define'，<'#define'>，19:0][@ 85,322:322 =''，< WS> ;,频道= 1,19:7][@ 86,323:340 ='IDENTIFIER_20_30_A'，< IDENTIFIER>，19:8][@ 87,341:343 =''，< WS> ;,频道= 1,19:26][@ 88,344:344 ='['，< OTHER>，19:29][@ 89,345:345 =''，< WS> ;，频道= 1,19:30][@ 90,346:346 ='1'，< NUMBER>，19:31][@ 91,347:347 ='，'，< OTHER>，19:32]...[@ 139,644:668 ='//======================'，< SEPARATOR>，34:0][@ 140,669:669 =''，< WS> ;，频道= 1,34:25][@ 141,670:673 ='this'，< IDENTIFIER>，34:26]...[@ 257,1103:1102 ='< EOF>'，< EOF>，51:0]最近更新1253找到声明:`/***大块评论*/#pragma一次`发现语句:`//#pragma一次`...找到的语句:`#define DEFAULT_A 10.0`...找到的语句:"//第2行"找到的语句:`//`...找到的语句:`//#define IDENTIFIER_3 Version.h//第5行`找到语句:`//第6行#define IDENTIFIER_THREE`找到声明:`//1//#define ID1//2`找到的语句:`#define USER_DESC_2"预热"PREHEAT_1_LABEL`找到的语句:`#define USER_DESC_2" abc"DEF"ABC2"M100(100)`找到的语句:`#define USER_GCODE_2"M140 S"STRINGIFY(PREHEAT_1_TEMP_BED)"\ nM104 S"STRINGIFY(PREHEAT_1_TEMP_HOTEND)`

语法 Header_trace.g4 (带有跟踪):

 语法Header_trace;CompilationUnit@init {System.out.println("Last update 1137");}:statement [this.getRuleNames()/*解析器规则名称*/] * EOF;语句[String [] rule_names]当地人[字符串rule_name，int start_line，int end_line]@之后{System.out.print("下一条语句是"+ $ rule_name)"；$ start_line = $ start.getLine();$ end_line = $ stop.getLine();如果($ start_line == $ end_line)System.out.print(行" + $ start_line)；别的System.out.print(行" + $ start_line +到" + $ end_line上的")；System.out.println(:");System.out.println(`" + $ text +`"))；}:   评论?pragma_directive [rule_names] {$ rule_name = $ pragma_directive.rule_name;}|评论?define_directive [rule_names] {$ rule_name = $ define_directive.rule_name;}|部分[rule_names] {$ rule_name = $ section.rule_name;}|comment_only [rule_names] {$ rule_name = $ comment_only.rule_name;}//删除跟踪时，comment_only可以用注释替换;pragma_directive [String [] rule_names]返回[String rule_name]:PRAGMA char_sequence{$ rule_name = rule_names [$ ctx.getRuleIndex()];};define_directive [String [] rule_names]返回[String rule_name]当地人[String dir_rule_name，int statement_line = 0]@init {$ dir_rule_name = rule_names [_localctx.getRuleIndex()];}:define_identifier replacement_comment [$ dir_rule_name，$ define_identifier.statement_line]{$ rule_name = $ replacement_comment.rule_name;};define_identifier返回[int statement_line]:LINE_COMMENT_DELIMITER?定义{$ statement_line = getCurrentToken().getLine();} IDENTIFIER;replace_comment [String dir_rule_name，int statement_line]返回[String rule_name]:any + = anything + line_comment?{$ rule_name = $ dir_rule_name +"具有重置价值"；System.out.print(任何匹配的内容:")；如果($ any.size()> 0)为(AnythingContext r:$ any)System.out.print(r.getText());别的System.out.print((nothing)")；System.out.println();}|{getCurrentToken().getLine()== $ statement_line}?line_comment{$ rule_name = $ dir_rule_name +"没有替换值，并且带有行内注释"；}|{getCurrentToken().getLine()！= $ statement_line}?{$ rule_name = $ dir_rule_name +"没有重置价值"；};[String [] rule_names]部分返回[String rule_name]:LINE_COMMENT_DELIMITER其他?部分char_sequence{$ rule_name = rule_names [$ ctx.getRuleIndex()];};comment_only [String [] rule_names]返回[String rule_name]:   评论{$ rule_name = rule_names [$ ctx.getRuleIndex()];};评论:BLOCK_COMMENT|line_comment|分隔符(标识符|等于)*;line_comment:LINE_COMMENT_DELIMITER任何内容*;任何事物:识别码|CHAR_SEQUENCE|细绳|数字|其他;char_sequence:CHAR_SEQUENCE|识别码;LINE_COMMENT_DELIMITER:"//"；PRAGMA:'#pragma';部分:"@ section"；定义:'#define';STRING:".*?''';';等于:'='+;分隔符:LINE_COMMENT_DELIMITER个等分；标识符:[a-zA-Z_] [a-zA-Z_0-9] *；CHAR_SEQUENCE:[a-zA-Z_] [a-zA-Z_0-9.] *；NUMBER:[0-9.] +；BLOCK_COMMENT:"/**".*?'*/';WS:[\ t] +->频道(隐藏);NL:('\ r''\ n'?|'\ n')->频道(隐藏);其他         : .;

执行:

  $ a4 Header_trace.g4$ javac标头* .java$ grun Header_trace编译单元-令牌input.txt[@ 0,0:23 ='/** \ n *块注释\ n */'，< BLOCK_COMMENT>，1:0][@ 1,24:24 ='\ n'，< NL> ;，频道= 1,3:3][@ 2,25:31 ='#pragma'，<'#pragma'>，4:0][@ 3,32:32 =''，< WS> ;，频道= 1,4:7][@ 4,33:36 ='一次'，< IDENTIFIER>，4:8][@ 5,37:37 ='\ n'，< NL> ;,频道= 1,4:12]...[@ 257,1103:1102 ='< EOF>'，< EOF>，51:0]最近更新1137下一条语句是第1至4行上的pragma_directive:`/***大块评论*/#pragma一次`...匹配的任何东西:10.0下一条语句是define_directive，其替换值在第20行:`#define DEFAULT_A 10.0`下一条语句是第22行的comment_only:`//===============================================================...下一条语句是第31行的comment_only:`//第2行`下一条语句是第32行上的comment_only:`//`...任何匹配的版本:Version.h下一条语句是define_directive，其替换值在第39行:`//#define IDENTIFIER_3 Version.h//第5行`下一条语句是第41至42行上的define_directive WITHOUT替换值:`//第6行#define IDENTIFIER_THREE`下一条语句是define_directive WITHOUT替换值，并在第44到45行包含行内注释:`//1//#define ID1//2`符合条件的所有内容:为PREHEAT_1_LABEL预热"下一条语句是define_directive，其替换值在第47行:`#define USER_DESC_2"预热"PREHEAT_1_LABEL`...

由于 LINE_COMMENT_DELIMITER?，就像您在定义指令规则的开头使用 COMMENT_START?一样，并且由于//，遇到行注释定界符时，不再需要切换到COMMENT_MODE模式.

第一种方法有一个困难:

  define_directive:LINE_COMMENT_DELIMITER?定义标识符有什么+ line_comment?|LINE_COMMENT_DELIMITER?定义{$ statement_line = getCurrentToken().getLine();}IDENTIFIER same_line_line_comment [$ statement_line]|LINE_COMMENT_DELIMITER?确定标识符same_line_line_comment [int statement_line]:{getCurrentToken().getLine()== $ statement_line}?line_comment

以下几行

 //第6行#定义IDENTIFIER_THREE//1

使用第二个替代项而不是第三个替代项进行解析:

 将语句行42与注释行44进行比较行44:0规则same_line_line_comment谓词失败:{getCurrentToken().getLine()== $ statement_line}?下一条语句是define_directive WITHOUT替换值，并在第41至42行包含行内注释:`//第6行#define IDENTIFIER_THREE`

尽管事实是子规则 same_line_line_comment 用错误的值保护，但语义谓词没有任何作用. FailedPredicateException 是不可取的，并且跟踪消息是错误的.可能与查找可见谓词有关..>

解决方案是将#define指令的处理分为具有语义谓词的固定部分 define_identifier 规则和可变部分 replacement_comment 规则(即有效的解析决定中，必须将其放置在替代方法的开头).

I'm writing the parser of c++ header style file and facing the issue with correct line comment handling.

CustomLexer.g4

lexer grammar CustomLexer;

SPACES          : [ \r\n\t]+ -> skip;
COMMENT_START   : '//' -> pushMode(COMMENT_MODE);
PRAGMA          : '#pragma';
SECTION         : '@section';
DEFINE          : '#define';
UNDEF           : '#undef';
IF              : '#if';
ELIF            : '#elif';
ELSE            : '#else';
IFDEF           : '#ifdef';
IFNDEF          : '#ifndef';
ENDIF           : '#endif';
ENABLED         : 'ENABLED';
DISABLED        : 'DISABLED';
EITHER          : 'EITHER';
ANY             : 'ANY';
DEFINED         : 'defined';
BOTH            : 'BOTH';
BOOLEAN_LITERAL :  'true' | 'false';
STRING          : '"' .*? '"';
HEXADECIMAL     : '0x' ([a-fA-F0-9])+;
LITERAL_SUFFIX  : 'L'|'u'|'U'|'Lu'|'LU'|'uL'|'UL'|'f'|'F';
IDENTIFIER      : [a-zA-Z_] [a-zA-Z_0-9]*;
BLOCK_COMMENT   : '/**' .*? '*/';
NUMBER          : ('-')? Int ('.' Digit*)? | '0';
CHAR_SEQUENCE   : [a-zA-Z_] [a-zA-Z_0-9.]*;
ARRAY_SEQUENCE  : '{' .*?  '}';
OPAREN          : '(';
CPAREN          : ')';
OBRACE          : '{';
CBRACE          : '}';
ADD             : '+';
SUBTRACT        : '-';
MULTIPLY        : '*';
DIVIDE          : '/';
MODULUS         : '%';
OR              : '||';
AND             : '&&';
EQUALS          : '==';
NEQUALS         : '!=';
GTEQUALS        : '>=';
LTEQUALS        : '<=';
GT              : '>';
LT              : '<';
EXCL            : '!';
QMARK           : '?';
COLON           : ':';
COMA            : ',';
OTHER           : .;

fragment Int    : [0-9] Digit* | '0';
fragment Digit  : [0-9];

mode COMMENT_MODE;
  COMMENT_MODE_DEFINE     : '#define' -> type(DEFINE), popMode;
  COMMENT_MODE_SECTION    : '@section' -> type(SECTION), popMode;
  COMMENT_MODE_IF         : '#if' -> type(IF), popMode;
  COMMENT_MODE_ENDIF      : '#endif' -> type(ENDIF), popMode;
  COMMENT_MODE_LINE_BREAK : [\r\n]+ -> skip, popMode;
  
  COMMENT_MODE_PART       : ~[\r\n];

CustomParser.g4:

parser grammar CustomParser;

options { tokenVocab=CustomLexer; }

compilationUnit
 : statement* EOF
 ;

statement
 : comment? pragmaDirective
 | comment? defineDirective
 | comment? undefDirective
 | comment? ifDirective
 | comment? ifdefDirective
 | comment? ifndefDirective
 | sectionLineComment
 | comment
 ;

pragmaDirective
 :   PRAGMA char_sequence
 ;

subDirectives
 : ifDirective+
 | ifdefDirective+
 | ifndefDirective+
 | defineDirective+
 | undefDirective+
 | comment+
 ;

ifdefDirective
 : IFDEF IDENTIFIER subDirectives+ ENDIF
 ;

ifndefDirective
 : IFNDEF IDENTIFIER subDirectives+ ENDIF
 ;

ifDirective
 : ifStatement elseIfStatement* elseStatement? ENDIF
 ;

ifStatement
 : IF expression (subDirectives)*
 ;

elseIfStatement
 : ELIF expression (subDirectives)*
 ;

elseStatement
 : ELSE (subDirectives)*
 ;

defineDirective
 : BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER BOOLEAN_LITERAL info_comment?
 | BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER (char_sequence COMA?)+ info_comment?
 | BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER OPAREN? NUMBER LITERAL_SUFFIX? CPAREN? info_comment?
 | BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER HEXADECIMAL info_comment?
 | BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER STRING info_comment?
 | BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER OBRACE? (ARRAY_SEQUENCE COMA?)+ CBRACE? info_comment?
 | BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER expression info_comment?
 | BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER info_comment?
 ;

undefDirective
 : BLOCK_COMMENT? COMMENT_START? UNDEF IDENTIFIER info_comment?;

sectionLineComment
 : COMMENT_START COMMENT_MODE_PART? SECTION char_sequence
 ;

comment
 : BLOCK_COMMENT
 | line_comment+
 ;

expression
 : simpleExpression
 | customExpression
 | enabledExpression
 | disabledExpression
 | bothExpression
 | eitherExpression
 | anyExpression
 | definedExpression
 | comparisonExpression
 | arithmeticExpression
 ;

arithmeticExpression
 : arithmeticExpression  (MULTIPLY | DIVIDE) arithmeticExpression
 | arithmeticExpression (ADD | SUBTRACT) arithmeticExpression
 | OPAREN arithmeticExpression CPAREN
 | expressionIdentifier
 ;

comparisonExpression
 : comparisonExpression (EQUALS | NEQUALS | GTEQUALS | LTEQUALS | GT | LT) comparisonExpression
 | comparisonExpression (AND | OR) comparisonExpression
 | EXCL? OPAREN comparisonExpression CPAREN
 | eitherExpression
 | enabledExpression
 | bothExpression
 | anyExpression
 | definedExpression
 | disabledExpression
 | customExpression
 | simpleExpression
 | expressionIdentifier
 ;

enabledExpression : EXCL? OPAREN? ENABLED OPAREN IDENTIFIER CPAREN CPAREN?;
disabledExpression : EXCL? OPAREN? DISABLED OPAREN IDENTIFIER CPAREN CPAREN?;
bothExpression : EXCL? OPAREN? BOTH OPAREN identifiers identifiers CPAREN CPAREN?;
eitherExpression : EXCL? OPAREN? EITHER OPAREN identifiers+ CPAREN CPAREN?;
anyExpression : EXCL? OPAREN? ANY OPAREN identifiers+ CPAREN CPAREN?;
definedExpression : EXCL? OPAREN? DEFINED OPAREN IDENTIFIER CPAREN CPAREN?;
customExpression : EXCL? IDENTIFIER OPAREN IDENTIFIER CPAREN;
simpleExpression : EXCL? IDENTIFIER;
expressionIdentifier : IDENTIFIER | NUMBER;

identifiers
 : IDENTIFIER COMA?
 ;

line_comment
 : COMMENT_START COMMENT_MODE_PART*
 ;

info_comment
 : COMMENT_START COMMENT_MODE_PART*
 ;

char_sequence
 : CHAR_SEQUENCE
 | IDENTIFIER
 ;

It is working fine with 95% of the directives and comments I have in my header file but few scenarios still not correctly handled:

1. Line comments

Input:

//1
//#define ID1 //2

This is the list of tokens:

01. compilationUnit
02.  statement:2
03.    comment:2
04.      line_comment
05.        COMMENT_START: "//"
06.        COMMENT_MODE_PART: "1"
07.      line_comment
08.        COMMENT_START: "//"
09.    defineDirective:8
10.      DEFINE: "#define"
11.      IDENTIFIER: "ID1"
12.      info_comment
13.        COMMENT_START: "//"
14.        COMMENT_MODE_PART: "2"
15.<EOF>

I want to achieve that the token on line 07 is a part of the token on line 09 and resolved as COMMENT_START token

2. Define directive with text

Other define rules are working correctly but:

#define USER_DESC_2 "abc " DEF "ABC2 \" M100 (100) 
#define USER_GCODE_2 "M140 S" STRINGIFY(PREHEAT_1_TEMP_BED) "\nM104 S" STRINGIFY(PREHEAT_1_TEMP_HOTEND)

These "define" directives are parsing with an exception

I would appreciate any help with resolving these 2 problems I have at this moment or any recommendations on how my lexer/parser can be optimized.

Thanks in advance!

=================================Update=================================== First test case:

Input:

//1
//#define ID1 //2

Current result:

01. compilationUnit
02.  statement:2
03.    comment:2
04.      line_comment
05.        COMMENT_START: "//"
06.        COMMENT_MODE_PART: "1"
07.      line_comment
08.        COMMENT_START: "//"
09.    defineDirective:8
10.      DEFINE: "#define"
11.      IDENTIFIER: "ID1"
12.      info_comment
13.        COMMENT_START: "//"
14.        COMMENT_MODE_PART: "2"
15.<EOF>

Expected result:

01. compilationUnit
02.  statement:2
03.    comment:2
04.      line_comment
05.        COMMENT_START: "//"
06.        COMMENT_MODE_PART: "1"
07.    defineDirective:8
08.      COMMENT_START: "//"  
09.      DEFINE: "#define"
10.      IDENTIFIER: "ID1"
11.      info_comment
12.        COMMENT_START: "//"
13.        COMMENT_MODE_PART: "2"
14.<EOF>

Second test case:

Input:

#define USER_DESC_2 "Preheat for " PREHEAT_1_LABEL

Current result:

01.compilationUnit
02. statement:2
03.  defineDirective:5
04.   DEFINE: "#define"
05.   IDENTIFIER: "USER_DESC_2"
06.   STRING: "\"Preheat for \""
07.  IDENTIFIER: "PREHEAT_1_LABEL"
<EOF>

Expected result:

01.compilationUnit
02. statement:2
03.  defineDirective:5
04.   DEFINE: "#define"
05.   IDENTIFIER: "USER_DESC_2"
06.   STRING: "\"Preheat for \" PREHEAT_1_LABEL"
<EOF>

In the expected result, STRING represents the result text. Here I do not really know if it is better to enhance STRING Lexer token definition or introduce new parsing rule to cover this case

解决方案

Mixing this post, your previous question and Bart's answer, and supposing that a define directive is in the form

optional_// #define IDENTIFIER replacement_value optional_line_comment

and given the input file input.txt

/**
 * BLOCK COMMENT
 */
#pragma once
//#pragma once

/**
 * BLOCK COMMENT
 */
#define CONFIGURATION_H_VERSION 12345

#define IDENTIFIER abcd
#define IDENTIFIER_1 abcd
#define IDENTIFIER_1 abcd.dd

#define IDENTIFIER_2 true // Line
#define IDENTIFIER_20 {ONE, TWO} // Line
#define IDENTIFIER_20_30   { 1, 2, 3, 4 }
#define IDENTIFIER_20_30_A   [ 1, 2, 3, 4 ]
#define DEFAULT_A 10.0

//================================================================
//============================= INFO =============================
//================================================================

/**
 * SEPARATE BLOCK COMMENT
 */

// Line 1
// Line 2
//

//======================= this is a section ======================
// @section test

// Line 3
#define IDENTIFIER_TWO "(ONE, TWO, THREE)" // Line 4
//#define IDENTIFIER_3 Version.h // Line 5

// Line 6
#define IDENTIFIER_THREE

//1
//#define ID1 //2

#define USER_DESC_2 "Preheat for " PREHEAT_1_LABEL

#define USER_DESC_2 "abc " DEF "ABC2 \" M100 (100) 
#define USER_GCODE_2 "M140 S" STRINGIFY(PREHEAT_1_TEMP_BED) "\nM104 S" STRINGIFY(PREHEAT_1_TEMP_HOTEND)

if I have well understood your two questions, the grammar must produce a statement for each directive or comment not followed by a directive. A directive can be preceded by a comment, which becomes part of the statement. A directive can be commented out and followed by an inline line comment (that is, on the same line).

Grammar Header.g4 (without trace) :

grammar Header;

compilationUnit
    @init {System.out.println("Last update 1253");}
    :   ( statement {System.out.println("Statement found : `" + $statement.text + "`");}
        )* EOF
    ;

statement
    :   comment? pragma_directive
    |   comment? define_directive
    |   section
    |   comment
    ;

pragma_directive
     :   PRAGMA char_sequence
     ;

define_directive
    :   define_identifier replacement_comment[$define_identifier.statement_line]
    ;
    
define_identifier returns [int statement_line]
    :   LINE_COMMENT_DELIMITER? DEFINE {$statement_line = getCurrentToken().getLine();} IDENTIFIER
    ;

replacement_comment [int statement_line]
    :   anything+ line_comment?
    |   {getCurrentToken().getLine() == $statement_line}? line_comment
    |   {getCurrentToken().getLine() != $statement_line}?
    ;

section
    :   LINE_COMMENT_DELIMITER OTHER? SECTION char_sequence
    ;

comment
    :   BLOCK_COMMENT
    |   line_comment
    |   SEPARATOR ( IDENTIFIER | EQUALS )*
    ;

line_comment
    :   LINE_COMMENT_DELIMITER anything*
    ;

anything
    :   IDENTIFIER
    |   CHAR_SEQUENCE 
    |   STRING
    |   NUMBER
    |   OTHER
    ;

char_sequence
    :   CHAR_SEQUENCE
    |   IDENTIFIER
    ;
 
LINE_COMMENT_DELIMITER : '//' ;
PRAGMA        : '#pragma';
SECTION       : '@section';
DEFINE        : '#define';
STRING        : '"' .*? '"';
EQUALS        : '='+ ;
SEPARATOR     : LINE_COMMENT_DELIMITER EQUALS ;
IDENTIFIER    : [a-zA-Z_] [a-zA-Z_0-9]*;
CHAR_SEQUENCE : [a-zA-Z_] [a-zA-Z_0-9.]*;
NUMBER        : [0-9.]+ ;
BLOCK_COMMENT : '/**' .*? '*/';
WS            : [ \t]+ -> channel(HIDDEN) ;
NL            : (   '\r' '\n'?
                  | '\n'
                ) -> channel(HIDDEN) ;
OTHER         : . ;

Execution :

$ export CLASSPATH=".:/usr/local/lib/antlr-4.9-complete.jar"
$ alias a4='java -jar /usr/local/lib/antlr-4.9-complete.jar'
$ alias grun='java org.antlr.v4.gui.TestRig'
$ a4 Header.g4 
$ javac Header*.java
$ grun Header compilationUnit -tokens input.txt
[@0,0:23='/**\n * BLOCK COMMENT\n */',<BLOCK_COMMENT>,1:0]
[@1,24:24='\n',<NL>,channel=1,3:3]
[@2,25:31='#pragma',<'#pragma'>,4:0]
[@3,32:32=' ',<WS>,channel=1,4:7]
[@4,33:36='once',<IDENTIFIER>,4:8]
[@5,37:37='\n',<NL>,channel=1,4:12]
...
[@84,315:321='#define',<'#define'>,19:0]
[@85,322:322=' ',<WS>,channel=1,19:7]
[@86,323:340='IDENTIFIER_20_30_A',<IDENTIFIER>,19:8]
[@87,341:343='   ',<WS>,channel=1,19:26]
[@88,344:344='[',<OTHER>,19:29]
[@89,345:345=' ',<WS>,channel=1,19:30]
[@90,346:346='1',<NUMBER>,19:31]
[@91,347:347=',',<OTHER>,19:32]
...
[@139,644:668='//=======================',<SEPARATOR>,34:0]
[@140,669:669=' ',<WS>,channel=1,34:25]
[@141,670:673='this',<IDENTIFIER>,34:26]
...
[@257,1103:1102='<EOF>',<EOF>,51:0]
Last update 1253
Statement found : `/**
 * BLOCK COMMENT
 */
#pragma once`
Statement found : `//#pragma once`
...
Statement found : `#define DEFAULT_A 10.0`
...
Statement found : `// Line 2`
Statement found : `//`
...
Statement found : `//#define IDENTIFIER_3 Version.h // Line 5`
Statement found : `// Line 6
#define IDENTIFIER_THREE`
Statement found : `//1
//#define ID1 //2`
Statement found : `#define USER_DESC_2 "Preheat for " PREHEAT_1_LABEL`
Statement found : `#define USER_DESC_2 "abc " DEF "ABC2 \" M100 (100)`
Statement found : `#define USER_GCODE_2 "M140 S" STRINGIFY(PREHEAT_1_TEMP_BED) "\nM104 S" STRINGIFY(PREHEAT_1_TEMP_HOTEND)`

Grammar Header_trace.g4 (with trace) :

grammar Header_trace;

compilationUnit
    @init {System.out.println("Last update 1137");}
    :   statement[this.getRuleNames() /* parser rule names */]* EOF
    ;

statement [String[] rule_names]
    locals [String rule_name, int start_line, int end_line]
    @after { System.out.print("The next statement is a " + $rule_name);
             $start_line = $start.getLine();
             $end_line   = $stop.getLine();
             if ($start_line == $end_line)
                 System.out.print(" on line " + $start_line);
             else
                 System.out.print(" on lines " + $start_line + " to " + $end_line);
             System.out.println(" : ");
             System.out.println("`" + $text + "`");
           }
    :   comment? pragma_directive [rule_names] {$rule_name = $pragma_directive.rule_name;}
    |   comment? define_directive [rule_names] {$rule_name = $define_directive.rule_name;}
    |   section [rule_names]                   {$rule_name = $section.rule_name;}
    |   comment_only [rule_names]              {$rule_name = $comment_only.rule_name;}
     // comment_only can be replaced by comment when the trace is removed
    ;

pragma_directive [String[] rule_names] returns [String rule_name]
     :   PRAGMA char_sequence
            { $rule_name = rule_names[$ctx.getRuleIndex()]; }
     ;

define_directive [String[] rule_names] returns [String rule_name]
    locals [String dir_rule_name, int statement_line = 0]
    @init {$dir_rule_name = rule_names[_localctx.getRuleIndex()];}
    :   define_identifier replacement_comment[$dir_rule_name, $define_identifier.statement_line]
            { $rule_name = $replacement_comment.rule_name; }
    ;
    
define_identifier returns [int statement_line]
    :   LINE_COMMENT_DELIMITER? DEFINE {$statement_line = getCurrentToken().getLine();} IDENTIFIER
    ;

replacement_comment [String dir_rule_name, int statement_line] returns [String rule_name]
    :   any+=anything+ line_comment?
            { $rule_name = $dir_rule_name + " with replacement value";
              System.out.print("          anything matched : " );
              if ($any.size() > 0)
                  for (AnythingContext r : $any)
                      System.out.print(r.getText());
              else
                  System.out.print("(nothing)");

              System.out.println();
            }
    |   {getCurrentToken().getLine() == $statement_line}?
        line_comment
            { $rule_name = $dir_rule_name + " WITHOUT replacement value and with inline line comment"; }
    |   {getCurrentToken().getLine() != $statement_line}?
            { $rule_name = $dir_rule_name + " WITHOUT replacement value"; }
    ;

section [String[] rule_names] returns [String rule_name]
    :   LINE_COMMENT_DELIMITER OTHER? SECTION char_sequence
            { $rule_name = rule_names[$ctx.getRuleIndex()]; }
    ;

comment_only [String[] rule_names] returns [String rule_name]
    :   comment
            { $rule_name = rule_names[$ctx.getRuleIndex()]; }
    ;

comment
    :   BLOCK_COMMENT
    |   line_comment
    |   SEPARATOR ( IDENTIFIER | EQUALS )*
    ;

line_comment
    :   LINE_COMMENT_DELIMITER anything*
    ;

anything
    :   IDENTIFIER
    |   CHAR_SEQUENCE 
    |   STRING
    |   NUMBER
    |   OTHER
    ;

char_sequence
    :   CHAR_SEQUENCE
    |   IDENTIFIER
    ;
 
LINE_COMMENT_DELIMITER : '//' ;
PRAGMA        : '#pragma';
SECTION       : '@section';
DEFINE        : '#define';
STRING        : '"' .*? '"';
EQUALS        : '='+ ;
SEPARATOR     : LINE_COMMENT_DELIMITER EQUALS ;
IDENTIFIER    : [a-zA-Z_] [a-zA-Z_0-9]*;
CHAR_SEQUENCE : [a-zA-Z_] [a-zA-Z_0-9.]*;
NUMBER        : [0-9.]+ ;
BLOCK_COMMENT : '/**' .*? '*/';
WS            : [ \t]+ -> channel(HIDDEN) ;
NL            : (   '\r' '\n'?
                  | '\n'
                ) -> channel(HIDDEN) ;
OTHER         : .;

Execution :

$ a4 Header_trace.g4 
$ javac Header*.java
$ grun Header_trace compilationUnit -tokens input.txt
[@0,0:23='/**\n * BLOCK COMMENT\n */',<BLOCK_COMMENT>,1:0]
[@1,24:24='\n',<NL>,channel=1,3:3]
[@2,25:31='#pragma',<'#pragma'>,4:0]
[@3,32:32=' ',<WS>,channel=1,4:7]
[@4,33:36='once',<IDENTIFIER>,4:8]
[@5,37:37='\n',<NL>,channel=1,4:12]
...
[@257,1103:1102='<EOF>',<EOF>,51:0]
Last update 1137
The next statement is a pragma_directive on lines 1 to 4 : 
`/**
 * BLOCK COMMENT
 */
#pragma once`
...
          anything matched : 10.0
The next statement is a define_directive with replacement value on line 20 : 
`#define DEFAULT_A 10.0`
The next statement is a comment_only on line 22 : 
`//================================================================`
...
The next statement is a comment_only on line 31 : 
`// Line 2`
The next statement is a comment_only on line 32 : 
`//`
...
          anything matched : Version.h
The next statement is a define_directive with replacement value on line 39 : 
`//#define IDENTIFIER_3 Version.h // Line 5`
The next statement is a define_directive WITHOUT replacement value on lines 41 to 42 : 
`// Line 6
#define IDENTIFIER_THREE`
The next statement is a define_directive WITHOUT replacement value and with inline line comment on lines 44 to 45 : 
`//1
//#define ID1 //2`
          anything matched : "Preheat for "PREHEAT_1_LABEL
The next statement is a define_directive with replacement value on line 47 : 
`#define USER_DESC_2 "Preheat for " PREHEAT_1_LABEL`
...

It happened that thanks to LINE_COMMENT_DELIMITER?, as you did with COMMENT_START?, at the beginning of the define directive rule, and because there is no special token after //, it was no longer necessary to switch to mode COMMENT_MODE when encountering a line comment delimiter.

There was one difficulty with this first approach :

define_directive
    :   LINE_COMMENT_DELIMITER? DEFINE IDENTIFIER anything+ line_comment?
    |   LINE_COMMENT_DELIMITER? DEFINE {$statement_line = getCurrentToken().getLine();}
        IDENTIFIER same_line_line_comment[$statement_line]
    |   LINE_COMMENT_DELIMITER? DEFINE IDENTIFIER

same_line_line_comment [int statement_line]
    :   {getCurrentToken().getLine() == $statement_line}?
        line_comment

The following lines

// Line 6
#define IDENTIFIER_THREE

//1

were parsed with the second alternative instead of the third :

compare statement line 42 with comment line 44
line 44:0 rule same_line_line_comment failed predicate: {getCurrentToken().getLine() == $statement_line}?
The next statement is a define_directive WITHOUT replacement value and with inline line comment on lines 41 to 42 : 
`// Line 6
#define IDENTIFIER_THREE`

Despite the fact that the subrule same_line_line_comment was guarded with a false value, the semantic predicate had no effect. The FailedPredicateException was undesirable and the trace message was wrong. It may have to do with Finding Visible Predicates.

The solution was to split the processing of the #define directive into a fixed part define_identifier rule and a variable part replacement_comment rule with the semantic predicate (which, to be effective in the parsing decision, must be placed at the beginning of the alternative).

这篇关于ANTLR4行注释和文本解析问题的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

ANTLR4行注释和文本解析问题 [英] ANTLR4 line comments and text parsing issue

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

ANTLR4行注释和文本解析问题 [英] ANTLR4 line comments and text parsing issue

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭