Lexer处理带有行号前缀的行 [英] Lexer to handle lines with line number prefix
问题描述
我正在为一种如下所示的语言编写解析器:
I'm writing a parser for a language that looks like the following:
L00<<identifier>>
L10<<keyword>>
L250<<identifier>>
<<identifier>>
也就是说,每一行可能以或可能不以格式为Lxxx..
的行号("L"后跟一个或多个数字)开头,后跟标识符或关键字.标识符是标准的[a-zA-Z_][a-zA-Z0-9_]*
,L
之后的位数不是固定的.行号和后面的标识符/关键字之间的空格是可选的(在大多数情况下不存在).
That is, each line may or may not start with a line number of the form Lxxx..
('L' followed by one or more digits) followed by an identifer or a keyword. Identifiers are standard [a-zA-Z_][a-zA-Z0-9_]*
and the number of digits following the L
is not fixed. Spaces between the line number and following identifer/keyword are optional (and not present in most cases).
我当前的词法分析器看起来像:
My current lexer looks like:
// Parser rules
commands : command*;
command : LINE_NUM? keyword NEWLINE
| LINE_NUM? IDENTIFIER NEWLINE;
keyword : KEYWORD_A | KEYWORD_B | ... ;
// Lexer rules
fragment INT : [0-9]+;
LINE_NUM : 'L' INT;
KEYWORD_A : 'someKeyword';
KEYWORD_B : 'reservedWord';
...
IDENTIFIER : [a-zA-Z_][a-zA-Z0-9_]*
但是,这导致所有以LINE_NUM
令牌开头的行都被标记为IDENTIFIER
s.
However this results in all lines beginning with a LINE_NUM
token to be tokenized as IDENTIFIER
s.
是否可以使用ANTLR语法正确标记此输入?
Is there a way to properly tokenize this input using an ANTLR grammar?
推荐答案
您需要向IDENTIFIER
添加语义谓词:
You need to add a semantic predicate to IDENTIFIER
:
IDENTIFIER
: {_input.getCharPositionInLine() != 0
|| _input.LA(1) != 'L'
|| !Character.isDigit(_input.LA(2))}?
[a-zA-Z_] [a-zA-Z0-9_]*
;
您还可以通过使用词法分析器模式来避免语义谓词.
You could also avoid semantic predicates by using lexer modes.
//
// Default mode is active at the beginning of a line
//
LINE_NUM
: 'L' [0-9]+ -> pushMode(NotBeginningOfLine)
;
KEYWORD_A : 'someKeyword' -> pushMode(NotBeginningOfLine);
KEYWORD_B : 'reservedWord' -> pushMode(NotBeginningOfLine);
IDENTIFIER
: ( 'L'
| 'L' [a-zA-Z_] [a-zA-Z0-9_]*
| [a-zA-KM-Z_] [a-zA-Z0-9_]*
)
-> pushMode(NotBeginningOfLine)
;
NL : ('\r' '\n'? | '\n');
mode NotBeginningOfLine;
NotBeginningOfLine_NL : ('\r' '\n'? | '\n') -> type(NL), popMode;
NotBeginningOfLine_KEYWORD_A : KEYWORD_A -> type(KEYWORD_A);
NotBeginningOfLine_KEYWORD_B : KEYWORD_B -> type(KEYWORD_B);
NotBeginningOfLine_IDENTIFIER
: [a-zA-Z_] [a-zA-Z0-9_]* -> type(IDENTIFIER)
;
这篇关于Lexer处理带有行号前缀的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!