ANTLR4解析器问题 [英] ANTLR4 Parser issues

查看:84
本文介绍了ANTLR4解析器问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试为c ++样式的头文件编写解析器,但未能正确配置解析器.

I'm trying to write parser for c++ style header file and failing to properly configure the parser.

词法分析器:

lexer grammar HeaderLexer;

SectionLineComment
    :   LINE_COMMENT_SIGN Section CharacterSequence
    ;

Pragma
    : POUND 'pragma'
    ;

Section
    :  AT_SIGN 'section'
    ;

Define
    : POUND 'define'
    | LINE_COMMENT_SIGN POUND 'define'
    ;

Booleanliteral
   : False
   | True
   ;

QuotedCharacterSequence
    :   '"' .*?  '"'
    ;

ArraySequence
    :   '{' .*?  '}'
    |   '[' .*?  ']'
    ;

IntNumber
    :   Digit+
    ;

DoubleNumber
    :   Digit+ POINT Digit+
    |   ZERO POINT Digit+
    ;

CharacterSequence
    :   Text+
    ;

Identifier
    :   [a-zA-Z_0-9]+
    ;

BlockComment
    : '/**' .*? '*/'
    ;

LineComment
    :   LINE_COMMENT_SIGN ~[\r\n]*
    ;

EmptyLineComment
    :   LINE_COMMENT_SIGN -> skip
    ;

Newline
    :   (   '\r' '\n'?
        |   '\n'
        )
        -> skip
    ;

WhiteSpace
   : [ \r\n\t]+ -> skip;

fragment POUND : '#';
fragment AT_SIGN : '@';
fragment LINE_COMMENT_SIGN : '//';
fragment POINT : '.';
fragment ZERO : '0';

fragment Digit
    :   [0-9]
    ;

fragment Text
    :   [a-zA-Z0-9.]
    ;


fragment False
   : 'false'
   ;

fragment True
   : 'true'
   ;

解析器:

parser grammar HeaderParser;

options { tokenVocab=HeaderLexer; }

compilationUnit: statement* EOF;

statement
    : comment? pragmaDirective
    | comment? defineDirective
    | section
    | comment
    ;

pragmaDirective
    :   Pragma CharacterSequence
    ;

defineDirective
    :   Define Identifier Booleanliteral LineComment?
    |   Define Identifier DoubleNumber LineComment?
    |   Define Identifier IntNumber LineComment?
    |   Define Identifier CharacterSequence LineComment?
    |   Define Identifier QuotedCharacterSequence LineComment?
    |   Define Identifier ArraySequence LineComment?
    |   Define Identifier
    ;

section: SectionLineComment;

comment
    : BlockComment
    | LineComment+
    ;

要解析的文本:

/**
 * BLOCK COMMENT
 */
#pragma once

/**
 * BLOCK COMMENT
 */
#define CONFIGURATION_H_VERSION 12345

#define IDENTIFIER abcd
#define IDENTIFIER_1 abcd
#define IDENTIFIER_1 abcd.dd

#define IDENTIFIER_2 true // Line
#define IDENTIFIER_20 {ONE, TWO} // Line
#define IDENTIFIER_20_30   { 1, 2, 3, 4 }
#define IDENTIFIER_20_30_A   [ 1, 2, 3, 4 ]
#define DEFAULT_A 10.0

//================================================================
//============================= INFO =============================
//================================================================

/**
 * SEPARATE BLOCK COMMENT
 */

//==================================================================
//============================= INFO ===============================
//==================================================================
// Line 1
// Line 2
//

// @section test

// Line 3
#define IDENTIFIER_TWO "(ONE, TWO, THREE)" // Line 4
//#define IDENTIFIER_3 Version.h // Line 5

// Line 6
#define IDENTIFIER_THREE

使用此配置,我遇到了几个问题:

With this configuration I have few issues:

  1. 解析器无法正确解析"#define IDENTIFIER abcd";在第11行
  2. "//@section测试";第36行的行被解析为行注释,但我需要将其解析为单独的令牌
  3. 注释的define指令的解析不起作用"//#define IDENTIFIER_3 Version.h//第5行"

推荐答案

只要解析时出现问题,就应该检查词法分析器生成的令牌类型.

Whenever there are problems when parsing, you should check what kind of tokens the lexer is producing.

以下是您的词法分析器生成的令牌:

Here are the tokens that you lexer produces:

BlockComment              `/**\n * BLOCK COMMENT\n */`
Pragma                    `#pragma`
CharacterSequence         `once`
BlockComment              `/**\n * BLOCK COMMENT\n */`
Define                    `#define`
Identifier                `CONFIGURATION_H_VERSION`
IntNumber                 `12345`
Define                    `#define`
CharacterSequence         `IDENTIFIER`
CharacterSequence         `abcd`
Define                    `#define`
Identifier                `IDENTIFIER_1`
CharacterSequence         `abcd`
Define                    `#define`
Identifier                `IDENTIFIER_1`
CharacterSequence         `abcd.dd`
Define                    `#define`
Identifier                `IDENTIFIER_2`
Booleanliteral            `true`
LineComment               `// Line`
Define                    `#define`
Identifier                `IDENTIFIER_20`
ArraySequence             `{ONE, TWO}`
LineComment               `// Line`
Define                    `#define`
Identifier                `IDENTIFIER_20_30`
ArraySequence             `{ 1, 2, 3, 4 }`
Define                    `#define`
Identifier                `IDENTIFIER_20_30_A`
ArraySequence             `[ 1, 2, 3, 4 ]`
Define                    `#define`
Identifier                `DEFAULT_A`
DoubleNumber              `10.0`
LineComment               `//================================================================`
LineComment               `//============================= INFO =============================`
LineComment               `//================================================================`
BlockComment              `/**\n * SEPARATE BLOCK COMMENT\n */`
LineComment               `//==================================================================`
LineComment               `//============================= INFO ===============================`
LineComment               `//==================================================================`
LineComment               `// Line 1`
LineComment               `// Line 2`
LineComment               `//`
LineComment               `// @section test`
LineComment               `// Line 3`
Define                    `#define`
Identifier                `IDENTIFIER_TWO`
QuotedCharacterSequence   `"(ONE, TWO, THREE)"`
LineComment               `// Line 4`
LineComment               `//#define IDENTIFIER_3 Version.h // Line 5`
LineComment               `// Line 6`
Define                    `#define`
Identifier                `IDENTIFIER_THREE`

如您在上面的列表中所见,由于定义了#define IDENTIFIER abcd 会产生以下令牌,因此无法正确解析:

As you can see in the list above, #define IDENTIFIER abcd is not being parsed properly because it produces the following tokens:

Define                    `#define`
CharacterSequence         `IDENTIFIER`
CharacterSequence         `abcd`

,因此不能匹配解析器规则:

and can therefor not match the parser rule:

defineDirective
    :   ...
    |   Define Identifier CharacterSequence LineComment?
    |   ...
    ;

如您所见,词法分析器独立于解析器运行.无论解析器是否尝试为文本"IDENTIFIER" 匹配 Identifier ,词法分析器都将为此始终生成 CharacterSequence 令牌.

As you can see, the lexer operates independently from the parser. No matter if the parser tries to match an Identifier for the text "IDENTIFIER", the lexer will always produce a CharacterSequence token for this.

词法分析器仅基于2条规则创建令牌:

The lexer creates tokens based on only 2 rules:

  1. 尝试匹配尽可能多的字符
  2. 如果2个(或更多)词法分析器规则可以匹配相同的字符,则该规则首先定义获胜"

由于上述规则,//#define IDENTIFIER_3 Version.h//第5行被标记为 LineComment (适用规则1:可能的).像一次这样的输入被标记为 CharacterSequence ,而不是 Identifier (适用规则2:定义了 CharacterSequence Identifier )

Because of the rules mentioned above, //#define IDENTIFIER_3 Version.h // Line 5 is tokenised as a LineComment (rule 1 applies: match as much as possible). And input like once is tokenised as a CharacterSequence and not as a Identifier (rule 2 applies: CharacterSequence is defined before Identifier)

要在注释内外对 #define 进行相同处理,可以使用

To have #define be treated the same in and outside a comment, you could use lexical modes. Whenever the lexer sees a //, it goes into a special comment-mode, and once in this comment mode, you will also recognise #define and @section tokens. You leace this mode when seeing one of these tokens (or when you see a line break, of course).

关于它的外观的快速演示:

A quick demo of how that could look like:

lexer grammar HeaderLexer;

SPACES          : [ \r\n\t]+ -> skip;
COMMENT_START   : '//' -> pushMode(COMMENT_MODE);
PRAGMA          : '#pragma';
SECTION         : '@section';
DEFINE          : '#define';
BOOLEAN_LITERAL :  'true' | 'false';
STRING          : '"' .*? '"';
IDENTIFIER      : [a-zA-Z_] [a-zA-Z_0-9]*;
BLOCK_COMMENT   : '/**' .*? '*/';
OTHER           : .;
NUMBER          : [0-9]+ ('.' [0-9]+)?;
CHAR_SEQUENCE   : [a-zA-Z_] [a-zA-Z_0-9.]*;
ARRAY_SEQUENCE  : '{' .*?  '}' | '[' .*?  ']';

mode COMMENT_MODE;

  // If we match one of the followinf 3 rules, leave this comment mode
  COMMENT_MODE_DEFINE     : '#define' -> type(DEFINE), popMode;
  COMMENT_MODE_SECTION    : '@section' -> type(SECTION), popMode;
  COMMENT_MODE_LINE_BREAK : [\r\n]+ -> skip, popMode;

  // If none of the 3 rules above matched, consume a single
  // character (which is part of the comment)
  COMMENT_MODE_PART       : ~[\r\n];

然后解析器可能看起来像这样:

and a parser could then look like this:

parser grammar HeaderParser;

options { tokenVocab=HeaderLexer; }

compilationUnit
 : statement* EOF
 ;

statement
 : comment? pragmaDirective
 | comment? defineDirective
 | sectionLineComment
 | comment
 ;

pragmaDirective
 :   PRAGMA char_sequence
 ;

defineDirective
 : DEFINE IDENTIFIER BOOLEAN_LITERAL line_comment?
 | DEFINE IDENTIFIER NUMBER line_comment?
 | DEFINE IDENTIFIER char_sequence line_comment?
 | DEFINE IDENTIFIER STRING line_comment?
 | DEFINE IDENTIFIER ARRAY_SEQUENCE line_comment?
 | DEFINE IDENTIFIER
 ;

sectionLineComment
 : COMMENT_START COMMENT_MODE_PART? SECTION char_sequence
 ;

comment
 : BLOCK_COMMENT
 | line_comment
 ;

line_comment
 : COMMENT_START COMMENT_MODE_PART*
 ;

char_sequence
 : CHAR_SEQUENCE
 | IDENTIFIER
 ;

这篇关于ANTLR4解析器问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆