使用 ANTLR4 在语法中对词法分析器规则进行排序 [英] Ordering lexer rules in a grammar using ANTLR4
问题描述
我正在使用 ANTLR4 来生成解析器.我是解析器语法的新手.我已经阅读了非常有帮助的
任何以红色突出显示的都是解析错误.
因此它不会将双尖括号内的ID"识别为ID".大概这是因为WORD"在我的词法分析器规则列表中排在第一位.但是,我没有说<< WORD >>"的规则,只有说<< ID >>"的规则,所以我不清楚为什么会这样.
如果我交换语法中ID"和WORD"的顺序,那么现在它们的顺序是这样的:
ID: LETTER ( LETTER | DIGIT)* ;字:字符+;
然后运行解析器,我得到一个这样的解析树:
所以现在func"和ID"规则得到了适当的处理,但没有一个WORD"被识别.
我该如何解决这个难题?
我想一种选择可能是将func"规则更改为<< WORD >>",并将所有内容视为单词,去掉ID".但我想区分文本词和变量标识符(例如,变量标识符中不允许使用特殊字符).
感谢您的帮助!
来自 权威的 ANTLR 4 参考 :
<块引用>ANTLR 通过以下方式解决词汇歧义将输入字符串与语法中首先指定的规则匹配.
用你的语法(在 Question.g4 中)和一个 t.text 文件包含
你好 <<姓名 >>,你九点好吗?
执行
$ grun Question doc -tokens -diagnostics t.text
给予
[@0,0:4='Hello',,1:0][@1,6:7='<<',<'<<'>,1:6][@2,9:12='name',,1:9][@3,14:15='>>',<'>>'>,1:14][@4,16:16=',',,1:16][@5,18:20='how',,1:18][@6,22:24='are',,1:22][@7,26:28='you',,1:26][@8,30:31='at',,1:30][@9,33:36='nine',<WORD>,1:33][@10,38:44='o'clock',<WORD>,1:38][@11,45:45='?',,1:45][@12,47:46='',,2:0]第 1:9 行不匹配的输入名称"需要 ID第 1:14 行无关输入 '>>'期待 {, '<<', WORD, PUNCT}
现在将item
规则中的WORD
改为word
,并添加一个word
规则:
item: (func | word) PUNCT?;词:词 |ID ;
并将 ID 放在 WORD 之前:
ID: LETTER ( LETTER | DIGIT)* ;字:字符+;
令牌现在是
[@0,0:4='Hello',,1:0][@1,6:7='<<',<'<<'>,1:6][@2,9:12='name',,1:9][@3,14:15='>>',<'>>'>,1:14][@4,16:16=',',,1:16][@5,18:20='how',,1:18][@6,22:24='are',,1:22][@7,26:28='you',,1:26][@8,30:31='at',,1:30][@9,33:36='nine',,1:33][@10,38:44='o'clock',<WORD>,1:38][@11,45:45='?',,1:45][@12,47:46='',,2:0]
并且没有更多错误.如 -gui 图形所示,您现在已将分支标识为 word
或 func
.
I'm using ANTLR4 to generate a parser. I am new to parser grammars. I've read the very helpful ANTLR Mega Tutorial but I am still stuck on how to properly order (and/or write) my lexer and parser rules.
I want the parser to be able to handle something like this:
Hello << name >>, how are you?
At runtime I will replace "<< name >>" with the user's name.
So mostly I am parsing text words (and punctuation, symbols, etc), except with the occasional "<< something >>" tag, which I am calling a "func" in my lexer rules.
Here is my grammar:
doc: item* EOF ;
item: (func | WORD) PUNCT? ;
func: '<<' ID '>>' ;
WS : [ \t\n\r] -> skip ;
fragment LETTER : [a-zA-Z] ;
fragment DIGIT : [0-9] ;
fragment CHAR : (LETTER | DIGIT | SYMB ) ;
WORD : CHAR+ ;
ID: LETTER ( LETTER | DIGIT)* ;
PUNCT : [.,?!] ;
fragment SYMB : ~[a-zA-Z0-9.,?! |{}<>] ;
Side note: I added "PUNCT?" at the end of the "item" rule because it is possible, such as in the example sentence I gave above, to have a comma appear right after a "func". But since you can also have a comma after a "WORD" then I decided to put the punctuation in "item" instead of in both of "func" and "WORD".
If I run this parser on the above sentence, I get a parse tree that looks like this:
Anything highlighted in red is a parse error.
So it is not recognizing the "ID" inside the double angle brackets as an "ID". Presumably this is because "WORD" comes first in my list of lexer rules. However, I have no rule that says "<< WORD >>", only a rule that says "<< ID >>", so I'm not clear on why that is happening.
If I swap the order of "ID" and "WORD" in my grammar, so now they are in this order:
ID: LETTER ( LETTER | DIGIT)* ;
WORD : CHAR+ ;
And run the parser, I get a parse tree like this:
So now the "func" and "ID" rules are being handled appropriately, but none of the "WORD"s are being recognized.
How do I get past this conundrum?
I suppose one option might be to change the "func" rule to "<< WORD >>" and just treat everything as words, doing away with "ID". But I wanted to differentiate a text word from a variable identifier (for instance, no special characters are allowed in a variable identifier).
Thanks for any help!
From The Definitive ANTLR 4 Reference :
ANTLR resolves lexical ambiguities by matching the input string to the rule specified first in the grammar.
With your grammar (in Question.g4) and a t.text file containing
Hello << name >>, how are you at nine o'clock?
the execution of
$ grun Question doc -tokens -diagnostics t.text
gives
[@0,0:4='Hello',<WORD>,1:0]
[@1,6:7='<<',<'<<'>,1:6]
[@2,9:12='name',<WORD>,1:9]
[@3,14:15='>>',<'>>'>,1:14]
[@4,16:16=',',<PUNCT>,1:16]
[@5,18:20='how',<WORD>,1:18]
[@6,22:24='are',<WORD>,1:22]
[@7,26:28='you',<WORD>,1:26]
[@8,30:31='at',<WORD>,1:30]
[@9,33:36='nine',<WORD>,1:33]
[@10,38:44='o'clock',<WORD>,1:38]
[@11,45:45='?',<PUNCT>,1:45]
[@12,47:46='<EOF>',<EOF>,2:0]
line 1:9 mismatched input 'name' expecting ID
line 1:14 extraneous input '>>' expecting {<EOF>, '<<', WORD, PUNCT}
Now change WORD
to word
in the item
rule, and add a word
rule :
item: (func | word) PUNCT? ;
word: WORD | ID ;
and put ID before WORD :
ID: LETTER ( LETTER | DIGIT)* ;
WORD : CHAR+ ;
The tokens are now
[@0,0:4='Hello',<ID>,1:0]
[@1,6:7='<<',<'<<'>,1:6]
[@2,9:12='name',<ID>,1:9]
[@3,14:15='>>',<'>>'>,1:14]
[@4,16:16=',',<PUNCT>,1:16]
[@5,18:20='how',<ID>,1:18]
[@6,22:24='are',<ID>,1:22]
[@7,26:28='you',<ID>,1:26]
[@8,30:31='at',<ID>,1:30]
[@9,33:36='nine',<ID>,1:33]
[@10,38:44='o'clock',<WORD>,1:38]
[@11,45:45='?',<PUNCT>,1:45]
[@12,47:46='<EOF>',<EOF>,2:0]
and there is no more error. As the -gui graphic shows, you have now branches identified as word
or func
.
这篇关于使用 ANTLR4 在语法中对词法分析器规则进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!