ANTLR孤岛文法和消耗过多的非贪婪规则 [英] ANTLR island grammars and a non-greedy rule that consumes too much

查看:23
本文介绍了ANTLR孤岛文法和消耗过多的非贪婪规则的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了孤岛语法和用于消耗除我想要的之外的所有东西"的非贪婪规则的问题.

I'm having a problem with an island grammar and a non-greedy rule used to consume "everything except what I want".

预期结果:

我的输入文件是一个 C 头文件,包含函数声明以及类型定义、结构、注释和预处理器定义.我想要的输出只是函数声明的解析和后续转换.我想忽略其他一切.

My input file is a C header file, containing function declarations along with typedefs, structs, comments, and preprocessor definitions. My desired output is parsing and subsequent transformation of function declarations ONLY. I would like to ignore everything else.

设置和我尝试过的:

我尝试 lex 和解析的头文件非常统一和一致.每个函数声明前面都有一个链接宏PK_linkage_m,并且所有函数都返回相同类型的PK_ERROR_code_t,例如:

The header file I'm attempting to lex and parse is very uniform and consistent. Every function declaration is preceded by a linkage macro PK_linkage_m and all functions return the same type PK_ERROR_code_t, ex:

PK_linkage_m PK_ERROR_code_t PK_function(...);

这些标记不会出现在函数声明的开头以外的任何地方.

These tokens don't appear anywhere other than at the start of a function declaration.

我将其视为孤岛文法,即文本海洋中的函数声明.我尝试使用链接标记 PK_linkage_m 来指示TEXT"的结束.和 PK_ERROR_code_t 标记作为函数声明的开始.

I have approached this as an island grammar, that is, function declarations in a sea of text. I have tried to use the linkage token PK_linkage_m to indicate the end of the "TEXT" and the PK_ERROR_code_t token as the start of the function declaration.

观察到的问题:

虽然对单个函数声明进行词法分析和解析有效,但当我在一个文件中有多个函数声明时它会失败.令牌流显示所有内容+所有函数声明+最后一个函数声明的PK_ERROR_code_t"被用作文本,然后只有文件中的 last 函数声明被正确解析.

While lexing and parsing a single function declaration works, it fails when I have more than one function declaration in a file. The token stream shows that "everything + all function declarations + PK_ERROR_code_t of last function declaration " are consumed as text, and then only the last function declaration in the file is correctly parsed.

我的一行总结是:我的非贪婪语法规则在 PK_ERROR_code_t 消耗太多之前消耗所有东西.

My one line summary is: My non-greedy grammar rule to consume everything before the PK_ERROR_code_t is consuming too much.

我可能错误地认为是解决方案:

以某种方式修复我的词法分析器非贪婪规则,使其消耗所有内容,直到找到 PK_linkage_m 标记.我的非贪婪规则似乎消耗太多了.

Fix my lexer non-greedy rule somehow so that it consumes everything until it finds the PK_linkage_m token. My non-greedy rule appears to be consume too much.

我还没有尝试过的:

因为这是我的第一个 ANTLR 项目,也是我很长一段时间以来的第一个语言解析项目,如果我错了,我很乐意重写它.我正在考虑使用行终止符跳过所有不以换行符开头的内容,但我不确定如何使其工作,也不确定它有什么根本不同.

As this is my first ANTLR project, and my first language parsing project in a very long time, I'd be more than happy to rewrite it if I'm wrong and getting wronger. I was considering using line terminators to skip everything that doesnt start with newline, but I'm not sure how to make that work and not sure how it's fundamentally different.

这是我的词法分析器文件 KernelLexer.g4:

lexer grammar KernelLexer;
// lexer should ignore everything except function declarations
// parser should never see tokens that are irrelevant

@lexer::members {
    public static final int WHITESPACE = 1;
}

PK_ERROR: 'PK_ERROR_code_t' -> mode(FUNCTION);
PK_LINK: 'PK_linkage_m';

//Doesnt work. Once it starts consuming, it doesnt stop.
TEXT_SEA: .*? PK_LINK -> skip;

TEXT_WS: ( ' ' | '\r' | '\n' | '\t' ) -> skip;

mode FUNCTION;

//These constants must go above ID rule because we want these to match first.
CONST: 'const';
OPEN_BLOCK: '(';
CLOSE_BLOCK: ');' -> mode(DEFAULT_MODE);
COMMA: ',';
STAR: '*';

COMMENTED_NAME: '/*' ID '*/';
COMMENT_RECEIVED: '/* received */' -> skip;
COMMENT_RETURNED: '/* returned */' -> skip;
COMMENT: '/*' .*? '*/' -> skip;

ID : ID_LETTER (ID_LETTER | DIGIT)*;
fragment ID_LETTER: 'a'..'z' | 'A'..'Z' | '_';
fragment DIGIT: '0'..'9';

WS: ( ' ' | '\r' | '\n' | '\t' ) -> skip;//channel(1);

这是我的解析器文件 KernelParser.g4:

parser grammar KernelParser;

options { tokenVocab=KernelLexer; }

file : func_decl+;

func_decl : PK_ERROR ID OPEN_BLOCK param_block CLOSE_BLOCK;

param_block: param_decl*;
param_decl: type_decl COMMENTED_NAME COMMA?;
type_decl: CONST? STAR* ID STAR* CONST?;

这是一个简单的示例输入文件:

/*some stuff*/

other stuff;

PK_linkage_m PK_ERROR_code_t PK_CLASS_ask_superclass
(
/* received */
PK_CLASS_t         /*class*/,             /* a class */
/* returned */
PK_CLASS_t *const  /*superclass*/         /* immediate superclass of class */
);

/*some stuff*/
blar blar;


PK_linkage_m PK_ERROR_code_t PK_CLASS_is_subclass
(
/* received */
PK_CLASS_t           /*may_be_subclass*/, /* a potential subclass */
PK_CLASS_t           /*class*/,           /* a class */
/* returned */
PK_LOGICAL_t *const  /*is_subclass*/      /* whether it was a subclass */
);


more stuff;

这是令牌输出:

line 28:0 token recognition error at: 'more stuff;\r\n'
[@0,312:326='PK_ERROR_code_t',<'PK_ERROR_code_t'>,18:13]
[@1,328:347='PK_CLASS_is_subclass',<ID>,18:29]
[@2,350:350='(',<'('>,19:0]
[@3,369:378='PK_CLASS_t',<ID>,21:0]
[@4,390:408='/*may_be_subclass*/',<COMMENTED_NAME>,21:21]
[@5,409:409=',',<','>,21:40]
[@6,439:448='PK_CLASS_t',<ID>,22:0]
[@7,460:468='/*class*/',<COMMENTED_NAME>,22:21]
[@8,469:469=',',<','>,22:30]
[@9,512:523='PK_LOGICAL_t',<ID>,24:0]
[@10,525:525='*',<'*'>,24:13]
[@11,526:530='const',<'const'>,24:14]
[@12,533:547='/*is_subclass*/',<COMMENTED_NAME>,24:21]
[@13,587:588=');',<');'>,25:0]
[@14,608:607='<EOF>',<EOF>,29:0]

推荐答案

与其在规则的开头包含 .*?(我总是尽量避免),为什么不呢?您尝试匹配:

Instead of including .*? at the start of a rule (which I'd always try to avoid), why don't you try to match either:

  • 默认模式下的 PK_ERROR(并像现在一样切换到另一种模式),
  • 或者匹配一个单个字符并跳过它?
  • a PK_ERROR in the default mode (and switch to another mode like you're now doing),
  • or else match a single character and skip it?

像这样:

lexer grammar KernelLexer;

PK_ERROR : 'PK_ERROR_code_t' -> mode(FUNCTION);
OTHER    : . -> skip;

mode FUNCTION;

// the rest of your rules as you have them now

请注意,这将匹配 PK_ERROR_code_t 以及输入 "PK_ERROR_code_t_MU ...",因此这是一种更安全的方法:

Note that this will match PK_ERROR_code_t as well for the input "PK_ERROR_code_t_MU ...", so this would be a safer way:

lexer grammar KernelLexer;

PK_ERROR : 'PK_ERROR_code_t' -> mode(FUNCTION);
OTHER    : ( [a-zA-Z_] [a-zA-Z_0-9]* | . ) -> skip;

mode FUNCTION;

// the rest of your rules as you have them now

您的解析器语法可能如下所示:

Your parser grammar could then look like this:

parser grammar KernelParser;

options { tokenVocab=KernelLexer; }

file        : func_decl+ EOF;
func_decl   : PK_ERROR ID OPEN_BLOCK param_block CLOSE_BLOCK;
param_block : param_decl*;
param_decl  : type_decl COMMENTED_NAME COMMA?;
type_decl   : CONST? STAR* ID STAR* CONST?;

导致您的示例输入被解析如下:

causing your example input to be parsed like this:

这篇关于ANTLR孤岛文法和消耗过多的非贪婪规则的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆