ANTLR岛语法和消耗过多的非贪婪规则 [英] ANTLR island grammars and a non-greedy rule that consumes too much

查看:80
本文介绍了ANTLR岛语法和消耗过多的非贪婪规则的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在使用岛语法和用于消耗我想要的东西以外的所有东西"的非贪婪规则时遇到了问题.

I'm having a problem with an island grammar and a non-greedy rule used to consume "everything except what I want".

所需结果:

我的输入文件是一个C头文件,其中包含函数声明以及typedef,结构,注释和预处理器定义.我期望的输出仅是函数声明的解析和后续转换.我想忽略其他一切.

My input file is a C header file, containing function declarations along with typedefs, structs, comments, and preprocessor definitions. My desired output is parsing and subsequent transformation of function declarations ONLY. I would like to ignore everything else.

设置和我尝试过的操作:

我要尝试解析和解析的头文件非常统一且一致.每个函数声明前面都有一个链接宏 PK_linkage_m ,并且所有函数都返回相同的类型 PK_ERROR_code_t ,例如:

The header file I'm attempting to lex and parse is very uniform and consistent. Every function declaration is preceded by a linkage macro PK_linkage_m and all functions return the same type PK_ERROR_code_t, ex:

PK_linkage_m PK_ERROR_code_t PK_function(...);

这些标记不会出现在函数声明开始处的任何地方.

These tokens don't appear anywhere other than at the start of a function declaration.

我已经把它当作一个孤岛语法来处理,也就是说,在大量的文本中声明了函数.我尝试使用链接令牌 PK_linkage_m 来指示"TEXT"文本的结尾.和 PK_ERROR_code_t 令牌作为函数声明的开始.

I have approached this as an island grammar, that is, function declarations in a sea of text. I have tried to use the linkage token PK_linkage_m to indicate the end of the "TEXT" and the PK_ERROR_code_t token as the start of the function declaration.

观察到的问题:

虽然对单个函数声明进行词法分析和语法分析是可行的,但是当我在一个文件中有多个函数声明时,它将失败.令牌流显示所有+所有函数声明+最后一个函数声明的PK_ERROR_code_t".作为文本使用,然后仅正确解析文件中的 last 函数声明.

While lexing and parsing a single function declaration works, it fails when I have more than one function declaration in a file. The token stream shows that "everything + all function declarations + PK_ERROR_code_t of last function declaration " are consumed as text, and then only the last function declaration in the file is correctly parsed.

我的一行总结是:我的非贪婪语法规则是在 PK_ERROR_code_t 消耗过多之前消耗所有内容.

My one line summary is: My non-greedy grammar rule to consume everything before the PK_ERROR_code_t is consuming too much.

我可能错误地认为是解决方案:

以某种方式修复我的词法分析器非贪婪规则,以便它消耗所有内容,直到找到 PK_linkage_m 令牌为止.我的非贪婪规则似乎消耗过多.

Fix my lexer non-greedy rule somehow so that it consumes everything until it finds the PK_linkage_m token. My non-greedy rule appears to be consume too much.

我没有尝试过的事情:

因为这是我的第一个ANTLR项目,也是我很长一段时间以来的第一个语言解析项目,所以如果我错了并且变得更糟,我很乐意重写它.我当时正在考虑使用行终止符来跳过所有以换行符开头的内容,但是我不确定如何使它生效,也不确定它的根本区别.

As this is my first ANTLR project, and my first language parsing project in a very long time, I'd be more than happy to rewrite it if I'm wrong and getting wronger. I was considering using line terminators to skip everything that doesnt start with newline, but I'm not sure how to make that work and not sure how it's fundamentally different.

这是我的词法分析器文件KernelLexer.g4:

lexer grammar KernelLexer;
// lexer should ignore everything except function declarations
// parser should never see tokens that are irrelevant

@lexer::members {
    public static final int WHITESPACE = 1;
}

PK_ERROR: 'PK_ERROR_code_t' -> mode(FUNCTION);
PK_LINK: 'PK_linkage_m';

//Doesnt work. Once it starts consuming, it doesnt stop.
TEXT_SEA: .*? PK_LINK -> skip;

TEXT_WS: ( ' ' | '\r' | '\n' | '\t' ) -> skip;

mode FUNCTION;

//These constants must go above ID rule because we want these to match first.
CONST: 'const';
OPEN_BLOCK: '(';
CLOSE_BLOCK: ');' -> mode(DEFAULT_MODE);
COMMA: ',';
STAR: '*';

COMMENTED_NAME: '/*' ID '*/';
COMMENT_RECEIVED: '/* received */' -> skip;
COMMENT_RETURNED: '/* returned */' -> skip;
COMMENT: '/*' .*? '*/' -> skip;

ID : ID_LETTER (ID_LETTER | DIGIT)*;
fragment ID_LETTER: 'a'..'z' | 'A'..'Z' | '_';
fragment DIGIT: '0'..'9';

WS: ( ' ' | '\r' | '\n' | '\t' ) -> skip;//channel(1);

这是我的解析器文件KernelParser.g4:

parser grammar KernelParser;

options { tokenVocab=KernelLexer; }

file : func_decl+;

func_decl : PK_ERROR ID OPEN_BLOCK param_block CLOSE_BLOCK;

param_block: param_decl*;
param_decl: type_decl COMMENTED_NAME COMMA?;
type_decl: CONST? STAR* ID STAR* CONST?;

这是一个简单的示例输入文件:

/*some stuff*/

other stuff;

PK_linkage_m PK_ERROR_code_t PK_CLASS_ask_superclass
(
/* received */
PK_CLASS_t         /*class*/,             /* a class */
/* returned */
PK_CLASS_t *const  /*superclass*/         /* immediate superclass of class */
);

/*some stuff*/
blar blar;


PK_linkage_m PK_ERROR_code_t PK_CLASS_is_subclass
(
/* received */
PK_CLASS_t           /*may_be_subclass*/, /* a potential subclass */
PK_CLASS_t           /*class*/,           /* a class */
/* returned */
PK_LOGICAL_t *const  /*is_subclass*/      /* whether it was a subclass */
);


more stuff;

以下是令牌输出:

line 28:0 token recognition error at: 'more stuff;\r\n'
[@0,312:326='PK_ERROR_code_t',<'PK_ERROR_code_t'>,18:13]
[@1,328:347='PK_CLASS_is_subclass',<ID>,18:29]
[@2,350:350='(',<'('>,19:0]
[@3,369:378='PK_CLASS_t',<ID>,21:0]
[@4,390:408='/*may_be_subclass*/',<COMMENTED_NAME>,21:21]
[@5,409:409=',',<','>,21:40]
[@6,439:448='PK_CLASS_t',<ID>,22:0]
[@7,460:468='/*class*/',<COMMENTED_NAME>,22:21]
[@8,469:469=',',<','>,22:30]
[@9,512:523='PK_LOGICAL_t',<ID>,24:0]
[@10,525:525='*',<'*'>,24:13]
[@11,526:530='const',<'const'>,24:14]
[@12,533:547='/*is_subclass*/',<COMMENTED_NAME>,24:21]
[@13,587:588=');',<');'>,25:0]
[@14,608:607='<EOF>',<EOF>,29:0]

推荐答案

而不是在规则开始时添加.*?(我总是会尽量避免),为什么呢?t您尝试匹配:

Instead of including .*? at the start of a rule (which I'd always try to avoid), why don't you try to match either:

  • 默认模式下的 PK_ERROR (并像现在一样切换到其他模式)
  • 还是匹配单个字符并跳过?
  • a PK_ERROR in the default mode (and switch to another mode like you're now doing),
  • or else match a single character and skip it?

类似这样的东西:

lexer grammar KernelLexer;

PK_ERROR : 'PK_ERROR_code_t' -> mode(FUNCTION);
OTHER    : . -> skip;

mode FUNCTION;

// the rest of your rules as you have them now

请注意,这也将与输入"PK_ERROR_code_t_MU ..." 匹配 PK_ERROR_code_t ,所以这将是一种更安全的方法:

Note that this will match PK_ERROR_code_t as well for the input "PK_ERROR_code_t_MU ...", so this would be a safer way:

lexer grammar KernelLexer;

PK_ERROR : 'PK_ERROR_code_t' -> mode(FUNCTION);
OTHER    : ( [a-zA-Z_] [a-zA-Z_0-9]* | . ) -> skip;

mode FUNCTION;

// the rest of your rules as you have them now

您的解析器语法可能如下所示:

Your parser grammar could then look like this:

parser grammar KernelParser;

options { tokenVocab=KernelLexer; }

file        : func_decl+ EOF;
func_decl   : PK_ERROR ID OPEN_BLOCK param_block CLOSE_BLOCK;
param_block : param_decl*;
param_decl  : type_decl COMMENTED_NAME COMMA?;
type_decl   : CONST? STAR* ID STAR* CONST?;

导致示例输入的解析如下:

causing your example input to be parsed like this:

这篇关于ANTLR岛语法和消耗过多的非贪婪规则的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆