ANTLR4 行注释和文本解析问题 [英] ANTLR4 line comments and text parsing issue

查看：43 发布时间：2021/11/11 4:10:36 antlr4

本文介绍了ANTLR4 行注释和文本解析问题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在编写 C++ 头样式文件的解析器，并面临正确处理行注释的问题.

CustomLexer.g4

词法分析器语法 CustomLexer;空格:[ \r\n\t]+ ->跳过;COMMENT_START : '//' ->推模式(COMMENT_MODE)；PRAGMA : '#pragma';部分:'@部分';定义:'#define';UNDEF : '#undef';如果如果';ELIF : '#elif';ELSE : '#else';IFDEF : '#ifdef';IFNDEF : '#ifndef';ENDIF : '#endif';启用:'启用';禁用:'禁用';要么:'要么';任何:'任何';定义:'定义'；两者:'两者';BOOLEAN_LITERAL : '真' |'错误的';字符串:'''.*?''';十六进制:'0x'([a-fA-F0-9])+；LITERAL_SUFFIX : 'L'|'u'|'U'|'Lu'|'LU'|'uL'|'UL'|'f'|'F';标识符:[a-zA-Z_] [a-zA-Z_0-9]*；BLOCK_COMMENT : '/**' .*?'*/';数字          : ('-')?Int ('.' 数字*)?|'0';CHAR_SEQUENCE : [a-zA-Z_] [a-zA-Z_0-9.]*;ARRAY_SEQUENCE : '{' .*?'}';OPAREN : '(';CPAREN : ')';OBRACE : '{';CBRACE : '}';添加:'+';减去        : '-';乘:'*';除法:'/';模数:'%';或:'||';AND : '&&';等于:'==';不等式:'!=';GTEQUALS : '>=';LTEQUALS : '<=';GT : '>';LT : '<';排除:'！';QMARK : '?';冒号           : ':';昏迷:',';其他           : .;片段整数:[0-9] 数字* |'0';片段数字:[0-9]；模式 COMMENT_MODE;COMMENT_MODE_DEFINE : '#define' ->类型(定义)，popMode；COMMENT_MODE_SECTION : '@section' ->类型(节)，popMode；COMMENT_MODE_IF : '#if' ->类型(IF)，popMode；COMMENT_MODE_ENDIF : '#endif' ->类型(ENDIF)，popMode；COMMENT_MODE_LINE_BREAK : [\r\n]+ ->跳过，popMode；COMMENT_MODE_PART : ~[\r\n];

CustomParser.g4:

解析器语法 CustomParser;选项 { tokenVocab=CustomLexer;}编译单元: 声明* EOF;陈述: 评论?语用指令|评论?定义指令|评论?未定义指令|评论?if指令|评论?ifdef指令|评论?ifndef指令|部分行注释|评论;语用指令: PRAGMA char_sequence;子指令: ifDirective+|ifdef指令+|ifndef指令+|定义指令+|undef指令+|评论+;ifdef指令: IFDEF IDENTIFIER subDirectives+ ENDIF;ifndef指令: IFNDEF IDENTIFIER subDirectives+ ENDIF;if指令: ifStatement elseIfStatement* elseStatement?万一;if语句:IF 表达式(子指令)*;elseIf语句:ELIF 表达式(子指令)*;else语句:ELSE(子指令)*;定义指令:BLOCK_COMMENT?COMMENT_START?DEFINE IDENTIFIER BOOLEAN_LITERAL info_comment?|BLOCK_COMMENT?COMMENT_START?DEFINE IDENTIFIER (char_sequence COMA?)+ info_comment?|BLOCK_COMMENT?COMMENT_START?定义标识符 OPAREN?NUMBER LITERAL_SUFFIX?卡伦?信息评论?|BLOCK_COMMENT?COMMENT_START?DEFINE IDENTIFIER HEXADECIMAL info_comment?|BLOCK_COMMENT?COMMENT_START?定义标识符字符串 info_comment?|BLOCK_COMMENT?COMMENT_START?定义标识符 OBRACE?(ARRAY_SEQUENCE COMA?)+ CBRACE?信息评论?|BLOCK_COMMENT?COMMENT_START?DEFINE IDENTIFIER 表达式 info_comment?|BLOCK_COMMENT?COMMENT_START?定义标识符 info_comment?;未定义指令:BLOCK_COMMENT?COMMENT_START?UNDEF IDENTIFIER info_comment?;部分行注释:COMMENT_START COMMENT_MODE_PART?SECTION char_sequence;评论:BLOCK_COMMENT|line_comment+;表达: 简单表达式|自定义表达式|启用表达式|禁用表达式|两个表达式|要么表达式|任何表达式|定义表达式|比较表达式|算术表达式;算术表达式: 算术表达式 (MULTIPLY | DIVIDE) 算术表达式|算术表达式 (ADD | SUBTRACT)|OPAREN 算术表达式 CPAREN|表达式标识符;比较表达式:comparisonExpression (EQUALS | NEQUALS | GTEQUALS | LTEQUALS | GT | LT) 比较表达式|比较表达式 (AND | OR) 比较表达式|排除?OPAREN 比较表达式 CPAREN|要么表达式|启用表达式|两个表达式|任何表达式|定义表达式|禁用表达式|自定义表达式|简单表达式|表达式标识符;enabledExpression:EXCL?奥帕伦?启用 OPAREN 标识符 CPAREN CPAREN?;禁用表达式:EXCL?奥帕伦?残疾人 OPAREN 标识符 CPAREN CPAREN?;bothExpression : EXCL?奥帕伦?BOTH OPAREN 标识符 标识符 CPAREN CPAREN?;要么表达式:EXCL?奥帕伦?EITHER OPAREN 标识符+ CPAREN CPAREN?;anyExpression:EXCL?奥帕伦?ANY OPAREN identifiers+ CPAREN CPAREN?;定义表达式:EXCL?奥帕伦?定义的 OPAREN 标识符 CPAREN CPAREN?;自定义表达式:EXCL?标识符 OPAREN 标识符 CPAREN；简单表达式:EXCL?标识符；表达式标识符:标识符 |数字;身份标识: 标识符昏迷?;line_comment:COMMENT_START COMMENT_MODE_PART*;info_comment:COMMENT_START COMMENT_MODE_PART*;字符序列:CHAR_SEQUENCE|标识符;

我的头文件中 95% 的指令和注释都可以正常工作，但仍然没有正确处理少数情况:

1.行注释

输入:

//1//#定义ID1//2

这是令牌列表:

<代码>01.编译单元02. 声明:203. 评论:204. line_comment05. COMMENT_START://"06. COMMENT_MODE_PART:1"07. line_comment08. COMMENT_START://"09.defineDirective:810. 定义:#define"11.标识符:ID1"12. info_comment13. COMMENT_START://"14. COMMENT_MODE_PART:2"；15.<EOF>

我想实现第 07 行的令牌是第 09 行令牌的一部分并解析为 COMMENT_START 令牌

2.用文本定义指令

其他定义规则工作正常，但:

#define USER_DESC_2 "abc "DEFABC2"M100 (100)#define USER_GCODE_2M140 S"STRINGIFY(PREHEAT_1_TEMP_BED)\nM104 S"STRINGIFY(PREHEAT_1_TEMP_HOTEND)

这些定义"指令解析异常

如果您能帮助我解决目前遇到的这两个问题，或者任何有关如何优化我的词法分析器/解析器的建议，我将不胜感激.

提前致谢！

==================================更新==================================第一个测试用例:

输入:

//1//#定义ID1//2

当前结果:

<代码>01.编译单元02. 声明:203. 评论:204. line_comment05. COMMENT_START://"06. COMMENT_MODE_PART:1"07. line_comment08. COMMENT_START://"09.defineDirective:810. 定义:#define"11.标识符:ID1"12. info_comment13. COMMENT_START://"14. COMMENT_MODE_PART:2"；15.<EOF>

预期结果:

<代码>01.编译单元02. 声明:203. 评论:204. line_comment05. COMMENT_START://"06. COMMENT_MODE_PART:1"07.定义指令:808. COMMENT_START://"09. 定义:#define"10.标识符:ID1"11. info_comment12. COMMENT_START://"13. COMMENT_MODE_PART:2"；14.<EOF>

第二个测试用例:

输入:

#define USER_DESC_2 预热"预热_1_标签

当前结果:

01.compilationUnit02. 声明:203.定义指令:504. 定义:#define"05. 标识符:USER_DESC_2"06. STRING:\"预热\"07. 标识符:PREHEAT_1_LABEL"<EOF>

预期结果:

01.compilationUnit02. 声明:203.定义指令:504. 定义:#define"05. 标识符:USER_DESC_2"06. STRING:\"预热\"PREHEAT_1_LABEL"<EOF>

在预期结果中，STRING 表示结果文本.这里我真的不知道是增强STRING Lexer 标记定义还是引入新的解析规则来覆盖这种情况

解决方案

混合这个帖子，你之前的问题和 Bart 的回答，并假设定义指令的形式为

optional_//#define IDENTIFIER replacement_value optional_line_comment

并给出输入文件input.txt

/*** 阻止评论*/#pragma once//#pragma 一次/*** 阻止评论*/#define CONFIGURATION_H_VERSION 12345#define IDENTIFIER abcd#define IDENTIFIER_1 abcd#define IDENTIFIER_1 abcd.dd#define IDENTIFIER_2 true//行#define IDENTIFIER_20 {ONE, TWO}//行#define IDENTIFIER_20_30 { 1, 2, 3, 4 }#define IDENTIFIER_20_30_A [ 1, 2, 3, 4 ]#define DEFAULT_A 10.0//================================================================//============================== 信息 ==============================//================================================================/*** 单独的块评论*///第 1 行//第 2 行////=============================================================================================================================================================================//@section 测试//第 3 行#define IDENTIFIER_TWO "(ONE, TWO, THREE)";//第 4 行//#define IDENTIFIER_3 Version.h//第 5 行//第 6 行#define IDENTIFIER_THREE//1//#定义ID1//2#define USER_DESC_2预热"PREHEAT_1_LABEL#define USER_DESC_2 "abc "DEFABC2"M100 (100)#define USER_GCODE_2M140 S"STRINGIFY(PREHEAT_1_TEMP_BED)\nM104 S"STRINGIFY(PREHEAT_1_TEMP_HOTEND)

如果我很好地理解了你的两个问题，语法必须为每个指令或注释生成一个声明，而不是后面跟着指令.指令前面可以有注释，注释成为语句的一部分.一个指令可以被注释掉，后面跟着一个行内注释(即在同一行).

语法Header.g4(无痕):

语法头；编译单元@init {System.out.println(上次更新 1253");}:( statement {System.out.println("Statement found : `" + $statement.text + "`");})* EOF;陈述:   评论?pragma_directive|评论?定义指令|部分|评论;pragma_directive: PRAGMA char_sequence;定义指令:define_identifier replacement_comment[$define_identifier.statement_line];define_identifier 返回 [int statement_line]:LINE_COMMENT_DELIMITER?定义 {$statement_line = getCurrentToken().getLine();} 标识符;替换注释 [int statement_line]:什么+ line_comment?|{getCurrentToken().getLine() == $statement_line}?line_comment|{getCurrentToken().getLine() != $statement_line}?;部分:LINE_COMMENT_DELIMITER 其他?SECTION char_sequence;评论:BLOCK_COMMENT|line_comment|分隔符(标识符 | 等于)*;line_comment:LINE_COMMENT_DELIMITER 任何内容*;任何事物: 标识符|字符序列|细绳|数字|其他;字符序列:CHAR_SEQUENCE|标识符;LINE_COMMENT_DELIMITER : '//' ;PRAGMA : '#pragma';部分:'@部分';定义:'#define';字符串:'''.*?''';等于 : '='+ ;分隔符:LINE_COMMENT_DELIMITER EQUALS；标识符:[a-zA-Z_] [a-zA-Z_0-9]*；CHAR_SEQUENCE : [a-zA-Z_] [a-zA-Z_0-9.]*;数字 : [0-9.]+ ;BLOCK_COMMENT : '/**' .*?'*/';WS : [ \t]+ ->频道(隐藏)；NL : ( '\r' '\n'?|'\n') ->频道(隐藏)；其他         : .;

执行:

$ export CLASSPATH=.:/usr/local/lib/antlr-4.9-complete.jar";$ alias a4='java -jar/usr/local/lib/antlr-4.9-complete.jar'$ alias grun='java org.antlr.v4.gui.TestRig'$ a4 Header.g4$ javac 头文件*.java$ grun Header compilerUnit -tokens input.txt[@0,0:23='/**\n * BLOCK COMMENT\n */',,1:0][@1,24:24='\n',,channel=1,3:3][@2,25:31='#pragma',<'#pragma'>,4:0][@3,32:32='',,channel=1,4:7][@4,33:36='once',,4:8][@5,37:37='\n',,channel=1,4:12]...[@84,315:321='#define',<'#define'>,19:0][@85,322:322='',<WS>,channel=1,19:7][@86,323:340='IDENTIFIER_20_30_A',,19:8][@87,341:343='',<WS>,channel=1,19:26][@88,344:344='[',<OTHER>,19:29][@89,345:345='',<WS>,channel=1,19:30][@90,346:346='1',,19:31][@91,347:347=',',<OTHER>,19:32]...[@139,644:668='//=========================',,34:0][@140,669:669='',<WS>,channel=1,34:25][@141,670:673='this',,34:26]...[@257,1103:1102='<EOF>',<EOF>,51:0]最后更新 1253找到的语句:`/*** 阻止评论*/#pragma once`找到语句:`//#pragma once`...找到的语句:`#define DEFAULT_A 10.0`...找到语句:`//第 2 行`找到语句:`//`...找到的语句:`//#define IDENTIFIER_3 Version.h//Line 5`找到的语句:`//第 6 行#define IDENTIFIER_THREE`找到语句:`//1//#定义ID1//2`找到的语句:`#define USER_DESC_2 "Preheat for "PREHEAT_1_LABEL`找到的语句:`#define USER_DESC_2 "abc "DEFABC2"M100 (100)`发现声明:`#define USER_GCODE_2M140 S"；STRINGIFY(PREHEAT_1_TEMP_BED)\nM104 S"STRINGIFY(PREHEAT_1_TEMP_HOTEND)`

语法Header_trace.g4(带跟踪):

grammar Header_trace;编译单元@init {System.out.println(上次更新 1137");}: statement[this.getRuleNames()/* 解析器规则名称 */]* EOF;语句 [String[] rule_names]当地人 [String rule_name, int start_line, int end_line]@after { System.out.print("下一条语句是" + $rule_name);$start_line = $start.getLine();$end_line = $stop.getLine();如果($start_line == $end_line)System.out.print("在线" + $start_line);别的System.out.print("在线"+$start_line+"到"+$end_line);System.out.println(" : ");System.out.println("`" + $text + "`");}:   评论?pragma_directive [rule_names] {$rule_name = $pragma_directive.rule_name;}|评论?define_directive [rule_names] {$rule_name = $define_directive.rule_name;}|section [rule_names] {$rule_name = $section.rule_name;}|comment_only [rule_names] {$rule_name = $comment_only.rule_name;}//删除trace时comment_only可以用comment替换;pragma_directive [String[] rule_names] 返回 [String rule_name]: PRAGMA char_sequence{ $rule_name = rule_names[$ctx.getRuleIndex()];};define_directive [String[] rule_names] 返回 [String rule_name]本地人 [String dir_rule_name, int statement_line = 0]@init {$dir_rule_name = rule_names[_localctx.getRuleIndex()];}:define_identifier replacement_comment[$dir_rule_name, $define_identifier.statement_line]{ $rule_name = $replacement_comment.rule_name;};define_identifier 返回 [int statement_line]:LINE_COMMENT_DELIMITER?定义 {$statement_line = getCurrentToken().getLine();} 标识符;Replacement_comment [String dir_rule_name, int statement_line] 返回 [String rule_name]:任何+=任何+ line_comment?{ $rule_name = $dir_rule_name + "有重置价值"；System.out.print("匹配的任何内容:");如果 ($any.size() > 0)for (AnythingContext r : $any)System.out.print(r.getText());别的System.out.print((无)");System.out.println();}|{getCurrentToken().getLine() == $statement_line}?line_comment{ $rule_name = $dir_rule_name + "没有替换值和行内注释"；}|{getCurrentToken().getLine() != $statement_line}?{ $rule_name = $dir_rule_name + "没有重置价值"；};[String[] rule_names] 部分返回 [String rule_name]:LINE_COMMENT_DELIMITER 其他?SECTION char_sequence{ $rule_name = rule_names[$ctx.getRuleIndex()];};comment_only [String[] rule_names] 返回 [String rule_name]:   评论{ $rule_name = rule_names[$ctx.getRuleIndex()];};评论:BLOCK_COMMENT|line_comment|分隔符(标识符 | 等于)*;line_comment:LINE_COMMENT_DELIMITER 任何内容*;任何事物: 标识符|字符序列|细绳|数字|其他;字符序列:CHAR_SEQUENCE|标识符;LINE_COMMENT_DELIMITER : '//' ;PRAGMA : '#pragma';部分:'@部分';定义:'#define';字符串:'''.*?''';等于 : '='+ ;分隔符:LINE_COMMENT_DELIMITER EQUALS；标识符:[a-zA-Z_] [a-zA-Z_0-9]*；CHAR_SEQUENCE : [a-zA-Z_] [a-zA-Z_0-9.]*;数字 : [0-9.]+ ;BLOCK_COMMENT : '/**' .*?'*/';WS : [ \t]+ ->频道(隐藏)；NL : ( '\r' '\n'?|'\n') ->频道(隐藏)；其他         : .;

执行:

$ a4 Header_trace.g4$ javac 头文件*.java$ grun Header_trace compilerUnit -tokens input.txt[@0,0:23='/**\n * BLOCK COMMENT\n */',,1:0][@1,24:24='\n',,channel=1,3:3][@2,25:31='#pragma',<'#pragma'>,4:0][@3,32:32='',,channel=1,4:7][@4,33:36='once',,4:8][@5,37:37='\n',,channel=1,4:12]...[@257,1103:1102='<EOF>',<EOF>,51:0]最后更新 1137下一个语句是第 1 到 4 行的 pragma_directive:`/*** 阻止评论*/#pragma once`...任何匹配:10.0下一个语句是第 20 行带有替换值的 define_directive :`#define DEFAULT_A 10.0`下一个语句是第 22 行的 comment_only :`//==================================================================`...下一个语句是第 31 行的 comment_only :`//第 2 行`下一个语句是第 32 行的 comment_only :`//`...任何匹配:Version.h下一个语句是第 39 行带有替换值的 define_directive :`//#define IDENTIFIER_3 Version.h//第 5 行`下一个语句是第 41 到 42 行的define_directive WITHOUT 替换值:`//第 6 行#define IDENTIFIER_THREE`下一个语句是一个没有替换值的define_directive，并且在第 44 到 45 行带有行内注释:`//1//#定义ID1//2`任何匹配的内容:Preheat for"PREHEAT_1_LABEL下一个语句是第 47 行带有替换值的 define_directive :`#define USER_DESC_2预热"PREHEAT_1_LABEL`...

这要归功于 LINE_COMMENT_DELIMITER?，就像您对 COMMENT_START? 所做的那样，在定义指令规则的开头，并且因为 COMMENT_START? 之后没有特殊标记code>//，遇到行注释分隔符时不再需要切换到模式COMMENT_MODE.

第一种方法有一个困难:

define_directive:LINE_COMMENT_DELIMITER?定义标识符任何东西+ line_comment?|LINE_COMMENT_DELIMITER?定义 {$statement_line = getCurrentToken().getLine();}标识符 same_line_line_comment[$statement_line]|LINE_COMMENT_DELIMITER?定义标识符same_line_line_comment [int statement_line]:{getCurrentToken().getLine() == $statement_line}?line_comment

以下几行

//第 6 行#define IDENTIFIER_THREE//1

被解析为第二个选项而不是第三个:

比较语句第 42 行和注释第 44 行第 44:0 行规则 same_line_line_comment 失败谓词:{getCurrentToken().getLine() == $statement_line}?下一个语句是一个没有替换值的define_directive，并且在第 41 到 42 行带有行内注释:`//第 6 行#define IDENTIFIER_THREE`

尽管子规则 same_line_line_comment 被假值保护，但语义谓词无效.FailedPredicateException 是不可取的，并且跟踪消息是错误的.它可能与寻找可见谓词有关.>

解决方案是将#define 指令的处理分成固定部分 define_identifier 规则和带有语义谓词的可变部分 replacement_comment 规则(即在解析决策中有效，必须放在替代的开头).

I'm writing the parser of c++ header style file and facing the issue with correct line comment handling.

CustomLexer.g4

lexer grammar CustomLexer;

SPACES          : [ \r\n\t]+ -> skip;
COMMENT_START   : '//' -> pushMode(COMMENT_MODE);
PRAGMA          : '#pragma';
SECTION         : '@section';
DEFINE          : '#define';
UNDEF           : '#undef';
IF              : '#if';
ELIF            : '#elif';
ELSE            : '#else';
IFDEF           : '#ifdef';
IFNDEF          : '#ifndef';
ENDIF           : '#endif';
ENABLED         : 'ENABLED';
DISABLED        : 'DISABLED';
EITHER          : 'EITHER';
ANY             : 'ANY';
DEFINED         : 'defined';
BOTH            : 'BOTH';
BOOLEAN_LITERAL :  'true' | 'false';
STRING          : '"' .*? '"';
HEXADECIMAL     : '0x' ([a-fA-F0-9])+;
LITERAL_SUFFIX  : 'L'|'u'|'U'|'Lu'|'LU'|'uL'|'UL'|'f'|'F';
IDENTIFIER      : [a-zA-Z_] [a-zA-Z_0-9]*;
BLOCK_COMMENT   : '/**' .*? '*/';
NUMBER          : ('-')? Int ('.' Digit*)? | '0';
CHAR_SEQUENCE   : [a-zA-Z_] [a-zA-Z_0-9.]*;
ARRAY_SEQUENCE  : '{' .*?  '}';
OPAREN          : '(';
CPAREN          : ')';
OBRACE          : '{';
CBRACE          : '}';
ADD             : '+';
SUBTRACT        : '-';
MULTIPLY        : '*';
DIVIDE          : '/';
MODULUS         : '%';
OR              : '||';
AND             : '&&';
EQUALS          : '==';
NEQUALS         : '!=';
GTEQUALS        : '>=';
LTEQUALS        : '<=';
GT              : '>';
LT              : '<';
EXCL            : '!';
QMARK           : '?';
COLON           : ':';
COMA            : ',';
OTHER           : .;

fragment Int    : [0-9] Digit* | '0';
fragment Digit  : [0-9];

mode COMMENT_MODE;
  COMMENT_MODE_DEFINE     : '#define' -> type(DEFINE), popMode;
  COMMENT_MODE_SECTION    : '@section' -> type(SECTION), popMode;
  COMMENT_MODE_IF         : '#if' -> type(IF), popMode;
  COMMENT_MODE_ENDIF      : '#endif' -> type(ENDIF), popMode;
  COMMENT_MODE_LINE_BREAK : [\r\n]+ -> skip, popMode;
  
  COMMENT_MODE_PART       : ~[\r\n];

CustomParser.g4:

parser grammar CustomParser;

options { tokenVocab=CustomLexer; }

compilationUnit
 : statement* EOF
 ;

statement
 : comment? pragmaDirective
 | comment? defineDirective
 | comment? undefDirective
 | comment? ifDirective
 | comment? ifdefDirective
 | comment? ifndefDirective
 | sectionLineComment
 | comment
 ;

pragmaDirective
 :   PRAGMA char_sequence
 ;

subDirectives
 : ifDirective+
 | ifdefDirective+
 | ifndefDirective+
 | defineDirective+
 | undefDirective+
 | comment+
 ;

ifdefDirective
 : IFDEF IDENTIFIER subDirectives+ ENDIF
 ;

ifndefDirective
 : IFNDEF IDENTIFIER subDirectives+ ENDIF
 ;

ifDirective
 : ifStatement elseIfStatement* elseStatement? ENDIF
 ;

ifStatement
 : IF expression (subDirectives)*
 ;

elseIfStatement
 : ELIF expression (subDirectives)*
 ;

elseStatement
 : ELSE (subDirectives)*
 ;

defineDirective
 : BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER BOOLEAN_LITERAL info_comment?
 | BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER (char_sequence COMA?)+ info_comment?
 | BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER OPAREN? NUMBER LITERAL_SUFFIX? CPAREN? info_comment?
 | BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER HEXADECIMAL info_comment?
 | BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER STRING info_comment?
 | BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER OBRACE? (ARRAY_SEQUENCE COMA?)+ CBRACE? info_comment?
 | BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER expression info_comment?
 | BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER info_comment?
 ;

undefDirective
 : BLOCK_COMMENT? COMMENT_START? UNDEF IDENTIFIER info_comment?;

sectionLineComment
 : COMMENT_START COMMENT_MODE_PART? SECTION char_sequence
 ;

comment
 : BLOCK_COMMENT
 | line_comment+
 ;

expression
 : simpleExpression
 | customExpression
 | enabledExpression
 | disabledExpression
 | bothExpression
 | eitherExpression
 | anyExpression
 | definedExpression
 | comparisonExpression
 | arithmeticExpression
 ;

arithmeticExpression
 : arithmeticExpression  (MULTIPLY | DIVIDE) arithmeticExpression
 | arithmeticExpression (ADD | SUBTRACT) arithmeticExpression
 | OPAREN arithmeticExpression CPAREN
 | expressionIdentifier
 ;

comparisonExpression
 : comparisonExpression (EQUALS | NEQUALS | GTEQUALS | LTEQUALS | GT | LT) comparisonExpression
 | comparisonExpression (AND | OR) comparisonExpression
 | EXCL? OPAREN comparisonExpression CPAREN
 | eitherExpression
 | enabledExpression
 | bothExpression
 | anyExpression
 | definedExpression
 | disabledExpression
 | customExpression
 | simpleExpression
 | expressionIdentifier
 ;

enabledExpression : EXCL? OPAREN? ENABLED OPAREN IDENTIFIER CPAREN CPAREN?;
disabledExpression : EXCL? OPAREN? DISABLED OPAREN IDENTIFIER CPAREN CPAREN?;
bothExpression : EXCL? OPAREN? BOTH OPAREN identifiers identifiers CPAREN CPAREN?;
eitherExpression : EXCL? OPAREN? EITHER OPAREN identifiers+ CPAREN CPAREN?;
anyExpression : EXCL? OPAREN? ANY OPAREN identifiers+ CPAREN CPAREN?;
definedExpression : EXCL? OPAREN? DEFINED OPAREN IDENTIFIER CPAREN CPAREN?;
customExpression : EXCL? IDENTIFIER OPAREN IDENTIFIER CPAREN;
simpleExpression : EXCL? IDENTIFIER;
expressionIdentifier : IDENTIFIER | NUMBER;

identifiers
 : IDENTIFIER COMA?
 ;

line_comment
 : COMMENT_START COMMENT_MODE_PART*
 ;

info_comment
 : COMMENT_START COMMENT_MODE_PART*
 ;

char_sequence
 : CHAR_SEQUENCE
 | IDENTIFIER
 ;

It is working fine with 95% of the directives and comments I have in my header file but few scenarios still not correctly handled:

1. Line comments

Input:

//1
//#define ID1 //2

This is the list of tokens:

01. compilationUnit
02.  statement:2
03.    comment:2
04.      line_comment
05.        COMMENT_START: "//"
06.        COMMENT_MODE_PART: "1"
07.      line_comment
08.        COMMENT_START: "//"
09.    defineDirective:8
10.      DEFINE: "#define"
11.      IDENTIFIER: "ID1"
12.      info_comment
13.        COMMENT_START: "//"
14.        COMMENT_MODE_PART: "2"
15.<EOF>

I want to achieve that the token on line 07 is a part of the token on line 09 and resolved as COMMENT_START token

2. Define directive with text

Other define rules are working correctly but:

#define USER_DESC_2 "abc " DEF "ABC2 \" M100 (100) 
#define USER_GCODE_2 "M140 S" STRINGIFY(PREHEAT_1_TEMP_BED) "\nM104 S" STRINGIFY(PREHEAT_1_TEMP_HOTEND)

These "define" directives are parsing with an exception

I would appreciate any help with resolving these 2 problems I have at this moment or any recommendations on how my lexer/parser can be optimized.

Thanks in advance!

=================================Update=================================== First test case:

Input:

//1
//#define ID1 //2

Current result:

01. compilationUnit
02.  statement:2
03.    comment:2
04.      line_comment
05.        COMMENT_START: "//"
06.        COMMENT_MODE_PART: "1"
07.      line_comment
08.        COMMENT_START: "//"
09.    defineDirective:8
10.      DEFINE: "#define"
11.      IDENTIFIER: "ID1"
12.      info_comment
13.        COMMENT_START: "//"
14.        COMMENT_MODE_PART: "2"
15.<EOF>

Expected result:

01. compilationUnit
02.  statement:2
03.    comment:2
04.      line_comment
05.        COMMENT_START: "//"
06.        COMMENT_MODE_PART: "1"
07.    defineDirective:8
08.      COMMENT_START: "//"  
09.      DEFINE: "#define"
10.      IDENTIFIER: "ID1"
11.      info_comment
12.        COMMENT_START: "//"
13.        COMMENT_MODE_PART: "2"
14.<EOF>

Second test case:

Input:

#define USER_DESC_2 "Preheat for " PREHEAT_1_LABEL

Current result:

01.compilationUnit
02. statement:2
03.  defineDirective:5
04.   DEFINE: "#define"
05.   IDENTIFIER: "USER_DESC_2"
06.   STRING: "\"Preheat for \""
07.  IDENTIFIER: "PREHEAT_1_LABEL"
<EOF>

Expected result:

01.compilationUnit
02. statement:2
03.  defineDirective:5
04.   DEFINE: "#define"
05.   IDENTIFIER: "USER_DESC_2"
06.   STRING: "\"Preheat for \" PREHEAT_1_LABEL"
<EOF>

In the expected result, STRING represents the result text. Here I do not really know if it is better to enhance STRING Lexer token definition or introduce new parsing rule to cover this case

解决方案

Mixing this post, your previous question and Bart's answer, and supposing that a define directive is in the form

optional_// #define IDENTIFIER replacement_value optional_line_comment

and given the input file input.txt

/**
 * BLOCK COMMENT
 */
#pragma once
//#pragma once

/**
 * BLOCK COMMENT
 */
#define CONFIGURATION_H_VERSION 12345

#define IDENTIFIER abcd
#define IDENTIFIER_1 abcd
#define IDENTIFIER_1 abcd.dd

#define IDENTIFIER_2 true // Line
#define IDENTIFIER_20 {ONE, TWO} // Line
#define IDENTIFIER_20_30   { 1, 2, 3, 4 }
#define IDENTIFIER_20_30_A   [ 1, 2, 3, 4 ]
#define DEFAULT_A 10.0

//================================================================
//============================= INFO =============================
//================================================================

/**
 * SEPARATE BLOCK COMMENT
 */

// Line 1
// Line 2
//

//======================= this is a section ======================
// @section test

// Line 3
#define IDENTIFIER_TWO "(ONE, TWO, THREE)" // Line 4
//#define IDENTIFIER_3 Version.h // Line 5

// Line 6
#define IDENTIFIER_THREE

//1
//#define ID1 //2

#define USER_DESC_2 "Preheat for " PREHEAT_1_LABEL

#define USER_DESC_2 "abc " DEF "ABC2 \" M100 (100) 
#define USER_GCODE_2 "M140 S" STRINGIFY(PREHEAT_1_TEMP_BED) "\nM104 S" STRINGIFY(PREHEAT_1_TEMP_HOTEND)

if I have well understood your two questions, the grammar must produce a statement for each directive or comment not followed by a directive. A directive can be preceded by a comment, which becomes part of the statement. A directive can be commented out and followed by an inline line comment (that is, on the same line).

Grammar Header.g4 (without trace) :

grammar Header;

compilationUnit
    @init {System.out.println("Last update 1253");}
    :   ( statement {System.out.println("Statement found : `" + $statement.text + "`");}
        )* EOF
    ;

statement
    :   comment? pragma_directive
    |   comment? define_directive
    |   section
    |   comment
    ;

pragma_directive
     :   PRAGMA char_sequence
     ;

define_directive
    :   define_identifier replacement_comment[$define_identifier.statement_line]
    ;
    
define_identifier returns [int statement_line]
    :   LINE_COMMENT_DELIMITER? DEFINE {$statement_line = getCurrentToken().getLine();} IDENTIFIER
    ;

replacement_comment [int statement_line]
    :   anything+ line_comment?
    |   {getCurrentToken().getLine() == $statement_line}? line_comment
    |   {getCurrentToken().getLine() != $statement_line}?
    ;

section
    :   LINE_COMMENT_DELIMITER OTHER? SECTION char_sequence
    ;

comment
    :   BLOCK_COMMENT
    |   line_comment
    |   SEPARATOR ( IDENTIFIER | EQUALS )*
    ;

line_comment
    :   LINE_COMMENT_DELIMITER anything*
    ;

anything
    :   IDENTIFIER
    |   CHAR_SEQUENCE 
    |   STRING
    |   NUMBER
    |   OTHER
    ;

char_sequence
    :   CHAR_SEQUENCE
    |   IDENTIFIER
    ;
 
LINE_COMMENT_DELIMITER : '//' ;
PRAGMA        : '#pragma';
SECTION       : '@section';
DEFINE        : '#define';
STRING        : '"' .*? '"';
EQUALS        : '='+ ;
SEPARATOR     : LINE_COMMENT_DELIMITER EQUALS ;
IDENTIFIER    : [a-zA-Z_] [a-zA-Z_0-9]*;
CHAR_SEQUENCE : [a-zA-Z_] [a-zA-Z_0-9.]*;
NUMBER        : [0-9.]+ ;
BLOCK_COMMENT : '/**' .*? '*/';
WS            : [ \t]+ -> channel(HIDDEN) ;
NL            : (   '\r' '\n'?
                  | '\n'
                ) -> channel(HIDDEN) ;
OTHER         : . ;

Execution :

$ export CLASSPATH=".:/usr/local/lib/antlr-4.9-complete.jar"
$ alias a4='java -jar /usr/local/lib/antlr-4.9-complete.jar'
$ alias grun='java org.antlr.v4.gui.TestRig'
$ a4 Header.g4 
$ javac Header*.java
$ grun Header compilationUnit -tokens input.txt
[@0,0:23='/**\n * BLOCK COMMENT\n */',<BLOCK_COMMENT>,1:0]
[@1,24:24='\n',<NL>,channel=1,3:3]
[@2,25:31='#pragma',<'#pragma'>,4:0]
[@3,32:32=' ',<WS>,channel=1,4:7]
[@4,33:36='once',<IDENTIFIER>,4:8]
[@5,37:37='\n',<NL>,channel=1,4:12]
...
[@84,315:321='#define',<'#define'>,19:0]
[@85,322:322=' ',<WS>,channel=1,19:7]
[@86,323:340='IDENTIFIER_20_30_A',<IDENTIFIER>,19:8]
[@87,341:343='   ',<WS>,channel=1,19:26]
[@88,344:344='[',<OTHER>,19:29]
[@89,345:345=' ',<WS>,channel=1,19:30]
[@90,346:346='1',<NUMBER>,19:31]
[@91,347:347=',',<OTHER>,19:32]
...
[@139,644:668='//=======================',<SEPARATOR>,34:0]
[@140,669:669=' ',<WS>,channel=1,34:25]
[@141,670:673='this',<IDENTIFIER>,34:26]
...
[@257,1103:1102='<EOF>',<EOF>,51:0]
Last update 1253
Statement found : `/**
 * BLOCK COMMENT
 */
#pragma once`
Statement found : `//#pragma once`
...
Statement found : `#define DEFAULT_A 10.0`
...
Statement found : `// Line 2`
Statement found : `//`
...
Statement found : `//#define IDENTIFIER_3 Version.h // Line 5`
Statement found : `// Line 6
#define IDENTIFIER_THREE`
Statement found : `//1
//#define ID1 //2`
Statement found : `#define USER_DESC_2 "Preheat for " PREHEAT_1_LABEL`
Statement found : `#define USER_DESC_2 "abc " DEF "ABC2 \" M100 (100)`
Statement found : `#define USER_GCODE_2 "M140 S" STRINGIFY(PREHEAT_1_TEMP_BED) "\nM104 S" STRINGIFY(PREHEAT_1_TEMP_HOTEND)`

Grammar Header_trace.g4 (with trace) :

grammar Header_trace;

compilationUnit
    @init {System.out.println("Last update 1137");}
    :   statement[this.getRuleNames() /* parser rule names */]* EOF
    ;

statement [String[] rule_names]
    locals [String rule_name, int start_line, int end_line]
    @after { System.out.print("The next statement is a " + $rule_name);
             $start_line = $start.getLine();
             $end_line   = $stop.getLine();
             if ($start_line == $end_line)
                 System.out.print(" on line " + $start_line);
             else
                 System.out.print(" on lines " + $start_line + " to " + $end_line);
             System.out.println(" : ");
             System.out.println("`" + $text + "`");
           }
    :   comment? pragma_directive [rule_names] {$rule_name = $pragma_directive.rule_name;}
    |   comment? define_directive [rule_names] {$rule_name = $define_directive.rule_name;}
    |   section [rule_names]                   {$rule_name = $section.rule_name;}
    |   comment_only [rule_names]              {$rule_name = $comment_only.rule_name;}
     // comment_only can be replaced by comment when the trace is removed
    ;

pragma_directive [String[] rule_names] returns [String rule_name]
     :   PRAGMA char_sequence
            { $rule_name = rule_names[$ctx.getRuleIndex()]; }
     ;

define_directive [String[] rule_names] returns [String rule_name]
    locals [String dir_rule_name, int statement_line = 0]
    @init {$dir_rule_name = rule_names[_localctx.getRuleIndex()];}
    :   define_identifier replacement_comment[$dir_rule_name, $define_identifier.statement_line]
            { $rule_name = $replacement_comment.rule_name; }
    ;
    
define_identifier returns [int statement_line]
    :   LINE_COMMENT_DELIMITER? DEFINE {$statement_line = getCurrentToken().getLine();} IDENTIFIER
    ;

replacement_comment [String dir_rule_name, int statement_line] returns [String rule_name]
    :   any+=anything+ line_comment?
            { $rule_name = $dir_rule_name + " with replacement value";
              System.out.print("          anything matched : " );
              if ($any.size() > 0)
                  for (AnythingContext r : $any)
                      System.out.print(r.getText());
              else
                  System.out.print("(nothing)");

              System.out.println();
            }
    |   {getCurrentToken().getLine() == $statement_line}?
        line_comment
            { $rule_name = $dir_rule_name + " WITHOUT replacement value and with inline line comment"; }
    |   {getCurrentToken().getLine() != $statement_line}?
            { $rule_name = $dir_rule_name + " WITHOUT replacement value"; }
    ;

section [String[] rule_names] returns [String rule_name]
    :   LINE_COMMENT_DELIMITER OTHER? SECTION char_sequence
            { $rule_name = rule_names[$ctx.getRuleIndex()]; }
    ;

comment_only [String[] rule_names] returns [String rule_name]
    :   comment
            { $rule_name = rule_names[$ctx.getRuleIndex()]; }
    ;

comment
    :   BLOCK_COMMENT
    |   line_comment
    |   SEPARATOR ( IDENTIFIER | EQUALS )*
    ;

line_comment
    :   LINE_COMMENT_DELIMITER anything*
    ;

anything
    :   IDENTIFIER
    |   CHAR_SEQUENCE 
    |   STRING
    |   NUMBER
    |   OTHER
    ;

char_sequence
    :   CHAR_SEQUENCE
    |   IDENTIFIER
    ;
 
LINE_COMMENT_DELIMITER : '//' ;
PRAGMA        : '#pragma';
SECTION       : '@section';
DEFINE        : '#define';
STRING        : '"' .*? '"';
EQUALS        : '='+ ;
SEPARATOR     : LINE_COMMENT_DELIMITER EQUALS ;
IDENTIFIER    : [a-zA-Z_] [a-zA-Z_0-9]*;
CHAR_SEQUENCE : [a-zA-Z_] [a-zA-Z_0-9.]*;
NUMBER        : [0-9.]+ ;
BLOCK_COMMENT : '/**' .*? '*/';
WS            : [ \t]+ -> channel(HIDDEN) ;
NL            : (   '\r' '\n'?
                  | '\n'
                ) -> channel(HIDDEN) ;
OTHER         : .;

Execution :

$ a4 Header_trace.g4 
$ javac Header*.java
$ grun Header_trace compilationUnit -tokens input.txt
[@0,0:23='/**\n * BLOCK COMMENT\n */',<BLOCK_COMMENT>,1:0]
[@1,24:24='\n',<NL>,channel=1,3:3]
[@2,25:31='#pragma',<'#pragma'>,4:0]
[@3,32:32=' ',<WS>,channel=1,4:7]
[@4,33:36='once',<IDENTIFIER>,4:8]
[@5,37:37='\n',<NL>,channel=1,4:12]
...
[@257,1103:1102='<EOF>',<EOF>,51:0]
Last update 1137
The next statement is a pragma_directive on lines 1 to 4 : 
`/**
 * BLOCK COMMENT
 */
#pragma once`
...
          anything matched : 10.0
The next statement is a define_directive with replacement value on line 20 : 
`#define DEFAULT_A 10.0`
The next statement is a comment_only on line 22 : 
`//================================================================`
...
The next statement is a comment_only on line 31 : 
`// Line 2`
The next statement is a comment_only on line 32 : 
`//`
...
          anything matched : Version.h
The next statement is a define_directive with replacement value on line 39 : 
`//#define IDENTIFIER_3 Version.h // Line 5`
The next statement is a define_directive WITHOUT replacement value on lines 41 to 42 : 
`// Line 6
#define IDENTIFIER_THREE`
The next statement is a define_directive WITHOUT replacement value and with inline line comment on lines 44 to 45 : 
`//1
//#define ID1 //2`
          anything matched : "Preheat for "PREHEAT_1_LABEL
The next statement is a define_directive with replacement value on line 47 : 
`#define USER_DESC_2 "Preheat for " PREHEAT_1_LABEL`
...

It happened that thanks to LINE_COMMENT_DELIMITER?, as you did with COMMENT_START?, at the beginning of the define directive rule, and because there is no special token after //, it was no longer necessary to switch to mode COMMENT_MODE when encountering a line comment delimiter.

There was one difficulty with this first approach :

define_directive
    :   LINE_COMMENT_DELIMITER? DEFINE IDENTIFIER anything+ line_comment?
    |   LINE_COMMENT_DELIMITER? DEFINE {$statement_line = getCurrentToken().getLine();}
        IDENTIFIER same_line_line_comment[$statement_line]
    |   LINE_COMMENT_DELIMITER? DEFINE IDENTIFIER

same_line_line_comment [int statement_line]
    :   {getCurrentToken().getLine() == $statement_line}?
        line_comment

The following lines

// Line 6
#define IDENTIFIER_THREE

//1

were parsed with the second alternative instead of the third :

compare statement line 42 with comment line 44
line 44:0 rule same_line_line_comment failed predicate: {getCurrentToken().getLine() == $statement_line}?
The next statement is a define_directive WITHOUT replacement value and with inline line comment on lines 41 to 42 : 
`// Line 6
#define IDENTIFIER_THREE`

Despite the fact that the subrule same_line_line_comment was guarded with a false value, the semantic predicate had no effect. The FailedPredicateException was undesirable and the trace message was wrong. It may have to do with Finding Visible Predicates.

The solution was to split the processing of the #define directive into a fixed part define_identifier rule and a variable part replacement_comment rule with the semantic predicate (which, to be effective in the parsing decision, must be placed at the beginning of the alternative).

这篇关于ANTLR4 行注释和文本解析问题的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

ANTLR4 行注释和文本解析问题 [英] ANTLR4 line comments and text parsing issue

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

ANTLR4 行注释和文本解析问题 [英] ANTLR4 line comments and text parsing issue

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭