ANTLR4 解析器问题 [英] ANTLR4 Parser issues

查看：31 发布时间：2021/11/11 4:09:04 parsing antlr4

本文介绍了ANTLR4 解析器问题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试为 C++ 风格的头文件编写解析器，但未能正确配置解析器.

I'm trying to write parser for c++ style header file and failing to properly configure the parser.

词法分析器:

lexer grammar HeaderLexer;

SectionLineComment
    :   LINE_COMMENT_SIGN Section CharacterSequence
    ;

Pragma
    : POUND 'pragma'
    ;

Section
    :  AT_SIGN 'section'
    ;

Define
    : POUND 'define'
    | LINE_COMMENT_SIGN POUND 'define'
    ;

Booleanliteral
   : False
   | True
   ;

QuotedCharacterSequence
    :   '"' .*?  '"'
    ;

ArraySequence
    :   '{' .*?  '}'
    |   '[' .*?  ']'
    ;

IntNumber
    :   Digit+
    ;

DoubleNumber
    :   Digit+ POINT Digit+
    |   ZERO POINT Digit+
    ;

CharacterSequence
    :   Text+
    ;

Identifier
    :   [a-zA-Z_0-9]+
    ;

BlockComment
    : '/**' .*? '*/'
    ;

LineComment
    :   LINE_COMMENT_SIGN ~[\r\n]*
    ;

EmptyLineComment
    :   LINE_COMMENT_SIGN -> skip
    ;

Newline
    :   (   '\r' '\n'?
        |   '\n'
        )
        -> skip
    ;

WhiteSpace
   : [ \r\n\t]+ -> skip;

fragment POUND : '#';
fragment AT_SIGN : '@';
fragment LINE_COMMENT_SIGN : '//';
fragment POINT : '.';
fragment ZERO : '0';

fragment Digit
    :   [0-9]
    ;

fragment Text
    :   [a-zA-Z0-9.]
    ;


fragment False
   : 'false'
   ;

fragment True
   : 'true'
   ;

解析器:

parser grammar HeaderParser;

options { tokenVocab=HeaderLexer; }

compilationUnit: statement* EOF;

statement
    : comment? pragmaDirective
    | comment? defineDirective
    | section
    | comment
    ;

pragmaDirective
    :   Pragma CharacterSequence
    ;

defineDirective
    :   Define Identifier Booleanliteral LineComment?
    |   Define Identifier DoubleNumber LineComment?
    |   Define Identifier IntNumber LineComment?
    |   Define Identifier CharacterSequence LineComment?
    |   Define Identifier QuotedCharacterSequence LineComment?
    |   Define Identifier ArraySequence LineComment?
    |   Define Identifier
    ;

section: SectionLineComment;

comment
    : BlockComment
    | LineComment+
    ;

要解析的文本:

/**
 * BLOCK COMMENT
 */
#pragma once

/**
 * BLOCK COMMENT
 */
#define CONFIGURATION_H_VERSION 12345

#define IDENTIFIER abcd
#define IDENTIFIER_1 abcd
#define IDENTIFIER_1 abcd.dd

#define IDENTIFIER_2 true // Line
#define IDENTIFIER_20 {ONE, TWO} // Line
#define IDENTIFIER_20_30   { 1, 2, 3, 4 }
#define IDENTIFIER_20_30_A   [ 1, 2, 3, 4 ]
#define DEFAULT_A 10.0

//================================================================
//============================= INFO =============================
//================================================================

/**
 * SEPARATE BLOCK COMMENT
 */

//==================================================================
//============================= INFO ===============================
//==================================================================
// Line 1
// Line 2
//

// @section test

// Line 3
#define IDENTIFIER_TWO "(ONE, TWO, THREE)" // Line 4
//#define IDENTIFIER_3 Version.h // Line 5

// Line 6
#define IDENTIFIER_THREE

使用此配置，我几乎没有问题:

With this configuration I have few issues:

解析器无法正确解析#define IDENTIFIER abcd"；第 11 行
"//@section 测试";第 36 行被解析为行注释，但我需要将其解析为单独的标记
解析注释的定义指令不起作用//#define IDENTIFIER_3 Version.h//第 5 行"

推荐答案

每当解析出现问题时，您应该检查词法分析器产生的标记类型.

Whenever there are problems when parsing, you should check what kind of tokens the lexer is producing.

以下是词法分析器生成的标记:

Here are the tokens that you lexer produces:

BlockComment              `/**\n * BLOCK COMMENT\n */`
Pragma                    `#pragma`
CharacterSequence         `once`
BlockComment              `/**\n * BLOCK COMMENT\n */`
Define                    `#define`
Identifier                `CONFIGURATION_H_VERSION`
IntNumber                 `12345`
Define                    `#define`
CharacterSequence         `IDENTIFIER`
CharacterSequence         `abcd`
Define                    `#define`
Identifier                `IDENTIFIER_1`
CharacterSequence         `abcd`
Define                    `#define`
Identifier                `IDENTIFIER_1`
CharacterSequence         `abcd.dd`
Define                    `#define`
Identifier                `IDENTIFIER_2`
Booleanliteral            `true`
LineComment               `// Line`
Define                    `#define`
Identifier                `IDENTIFIER_20`
ArraySequence             `{ONE, TWO}`
LineComment               `// Line`
Define                    `#define`
Identifier                `IDENTIFIER_20_30`
ArraySequence             `{ 1, 2, 3, 4 }`
Define                    `#define`
Identifier                `IDENTIFIER_20_30_A`
ArraySequence             `[ 1, 2, 3, 4 ]`
Define                    `#define`
Identifier                `DEFAULT_A`
DoubleNumber              `10.0`
LineComment               `//================================================================`
LineComment               `//============================= INFO =============================`
LineComment               `//================================================================`
BlockComment              `/**\n * SEPARATE BLOCK COMMENT\n */`
LineComment               `//==================================================================`
LineComment               `//============================= INFO ===============================`
LineComment               `//==================================================================`
LineComment               `// Line 1`
LineComment               `// Line 2`
LineComment               `//`
LineComment               `// @section test`
LineComment               `// Line 3`
Define                    `#define`
Identifier                `IDENTIFIER_TWO`
QuotedCharacterSequence   `"(ONE, TWO, THREE)"`
LineComment               `// Line 4`
LineComment               `//#define IDENTIFIER_3 Version.h // Line 5`
LineComment               `// Line 6`
Define                    `#define`
Identifier                `IDENTIFIER_THREE`

正如您在上面的列表中看到的，#define IDENTIFIER abcd 没有被正确解析，因为它产生了以下标记:

As you can see in the list above, #define IDENTIFIER abcd is not being parsed properly because it produces the following tokens:

Define                    `#define`
CharacterSequence         `IDENTIFIER`
CharacterSequence         `abcd`

因此不能匹配解析器规则:

and can therefor not match the parser rule:

defineDirective
    :   ...
    |   Define Identifier CharacterSequence LineComment?
    |   ...
    ;

如您所见，词法分析器独立于解析器运行.无论解析器是否尝试为文本 "IDENTIFIER" 匹配 Identifier，词法分析器将始终为此生成一个 CharacterSequence 标记.

As you can see, the lexer operates independently from the parser. No matter if the parser tries to match an Identifier for the text "IDENTIFIER", the lexer will always produce a CharacterSequence token for this.

词法分析器仅基于 2 条规则创建标记:

The lexer creates tokens based on only 2 rules:

尽量匹配尽可能多的字符
如果 2 个(或更多)词法分析器规则可以匹配相同的字符，则首先定义的规则获胜"

由于上述规则，//#define IDENTIFIER_3 Version.h//Line 5 被标记为 LineComment(规则 1 适用:匹配可能的).像 once 这样的输入被标记为 CharacterSequence 而不是 Identifier(规则 2 适用:CharacterSequence 被定义标识符之前)

Because of the rules mentioned above, //#define IDENTIFIER_3 Version.h // Line 5 is tokenised as a LineComment (rule 1 applies: match as much as possible). And input like once is tokenised as a CharacterSequence and not as a Identifier (rule 2 applies: CharacterSequence is defined before Identifier)

要让 #define 在评论内外被同等对待，您可以使用词汇模式.每当词法分析器看到 // 时，它就会进入一种特殊的注释模式，一旦进入这种注释模式，您还将识别 #define 和 @section 标记.当您看到这些标记之一时(当然，或者当您看到换行符时)，您就会进入这种模式.

To have #define be treated the same in and outside a comment, you could use lexical modes. Whenever the lexer sees a //, it goes into a special comment-mode, and once in this comment mode, you will also recognise #define and @section tokens. You leace this mode when seeing one of these tokens (or when you see a line break, of course).

快速演示它的外观:

lexer grammar HeaderLexer;

SPACES          : [ \r\n\t]+ -> skip;
COMMENT_START   : '//' -> pushMode(COMMENT_MODE);
PRAGMA          : '#pragma';
SECTION         : '@section';
DEFINE          : '#define';
BOOLEAN_LITERAL :  'true' | 'false';
STRING          : '"' .*? '"';
IDENTIFIER      : [a-zA-Z_] [a-zA-Z_0-9]*;
BLOCK_COMMENT   : '/**' .*? '*/';
OTHER           : .;
NUMBER          : [0-9]+ ('.' [0-9]+)?;
CHAR_SEQUENCE   : [a-zA-Z_] [a-zA-Z_0-9.]*;
ARRAY_SEQUENCE  : '{' .*?  '}' | '[' .*?  ']';

mode COMMENT_MODE;

  // If we match one of the followinf 3 rules, leave this comment mode
  COMMENT_MODE_DEFINE     : '#define' -> type(DEFINE), popMode;
  COMMENT_MODE_SECTION    : '@section' -> type(SECTION), popMode;
  COMMENT_MODE_LINE_BREAK : [\r\n]+ -> skip, popMode;

  // If none of the 3 rules above matched, consume a single
  // character (which is part of the comment)
  COMMENT_MODE_PART       : ~[\r\n];

然后解析器看起来像这样:

and a parser could then look like this:

parser grammar HeaderParser;

options { tokenVocab=HeaderLexer; }

compilationUnit
 : statement* EOF
 ;

statement
 : comment? pragmaDirective
 | comment? defineDirective
 | sectionLineComment
 | comment
 ;

pragmaDirective
 :   PRAGMA char_sequence
 ;

defineDirective
 : DEFINE IDENTIFIER BOOLEAN_LITERAL line_comment?
 | DEFINE IDENTIFIER NUMBER line_comment?
 | DEFINE IDENTIFIER char_sequence line_comment?
 | DEFINE IDENTIFIER STRING line_comment?
 | DEFINE IDENTIFIER ARRAY_SEQUENCE line_comment?
 | DEFINE IDENTIFIER
 ;

sectionLineComment
 : COMMENT_START COMMENT_MODE_PART? SECTION char_sequence
 ;

comment
 : BLOCK_COMMENT
 | line_comment
 ;

line_comment
 : COMMENT_START COMMENT_MODE_PART*
 ;

char_sequence
 : CHAR_SEQUENCE
 | IDENTIFIER
 ;

这篇关于ANTLR4 解析器问题的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

ANTLR4 解析器问题 [英] ANTLR4 Parser issues

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

ANTLR4 解析器问题 [英] ANTLR4 Parser issues

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭