使用 ANTLR4 在语法中对词法分析器规则进行排序 [英] Ordering lexer rules in a grammar using ANTLR4

查看:24
本文介绍了使用 ANTLR4 在语法中对词法分析器规则进行排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 ANTLR4 来生成解析器.我是解析器语法的新手.我已经阅读了非常有帮助的

任何以红色突出显示的都是解析错误.

因此它不会将双尖括号内的ID"识别为ID".大概这是因为WORD"在我的词法分析器规则列表中排在第一位.但是,我没有说<< WORD >>"的规则,只有说<< ID >>"的规则,所以我不清楚为什么会这样.

如果我交换语法中ID"和WORD"的顺序,那么现在它们的顺序是这样的:

ID: LETTER ( LETTER | DIGIT)* ;字:字符+;

然后运行解析器,我得到一个这样的解析树:

所以现在func"和ID"规则得到了适当的处理,但没有一个WORD"被识别.

我该如何解决这个难题?

我想一种选择可能是将func"规则更改为<< WORD >>",并将所有内容视为单词,去掉ID".但我想区分文本词和变量标识符(例如,变量标识符中不允许使用特殊字符).

感谢您的帮助!

解决方案

来自 权威的 ANTLR 4 参考 :

<块引用>

ANTLR 通过以下方式解决词汇歧义将输入字符串与语法中首先指定的规则匹配.

用你的语法(在 Question.g4 中)和一个 t.text 文件包含

你好 <<姓名 >>,你九点好吗?

执行

$ grun Question doc -tokens -diagnostics t.text

给予

[@0,0:4='Hello',,1:0][@1,6:7='<<',<'<<'>,1:6][@2,9:12='name',,1:9][@3,14:15='>>',<'>>'>,1:14][@4,16:16=',',,1:16][@5,18:20='how',,1:18][@6,22:24='are',,1:22][@7,26:28='you',,1:26][@8,30:31='at',,1:30][@9,33:36='nine',<WORD>,1:33][@10,38:44='o'clock',<WORD>,1:38][@11,45:45='?',,1:45][@12,47:46='',,2:0]第 1:9 行不匹配的输入名称"需要 ID第 1:14 行无关输入 '>>'期待 {, '<<', WORD, PUNCT}

现在将item规则中的WORD改为word,并添加一个word规则:

item: (func | word) PUNCT?;词:词 |ID ;

并将 ID 放在 WORD 之前:

ID: LETTER ( LETTER | DIGIT)* ;字:字符+;

令牌现在是

[@0,0:4='Hello',,1:0][@1,6:7='<<',<'<<'>,1:6][@2,9:12='name',,1:9][@3,14:15='>>',<'>>'>,1:14][@4,16:16=',',,1:16][@5,18:20='how',,1:18][@6,22:24='are',,1:22][@7,26:28='you',,1:26][@8,30:31='at',,1:30][@9,33:36='nine',,1:33][@10,38:44='o'clock',<WORD>,1:38][@11,45:45='?',,1:45][@12,47:46='',,2:0]

并且没有更多错误.如 -gui 图形所示,您现在已将分支标识为 wordfunc.

I'm using ANTLR4 to generate a parser. I am new to parser grammars. I've read the very helpful ANTLR Mega Tutorial but I am still stuck on how to properly order (and/or write) my lexer and parser rules.

I want the parser to be able to handle something like this:

Hello << name >>, how are you?

At runtime I will replace "<< name >>" with the user's name.

So mostly I am parsing text words (and punctuation, symbols, etc), except with the occasional "<< something >>" tag, which I am calling a "func" in my lexer rules.

Here is my grammar:

doc: item* EOF ;
item: (func | WORD) PUNCT? ;
func: '<<' ID '>>' ;

WS : [ \t\n\r] -> skip ;
fragment LETTER : [a-zA-Z] ;
fragment DIGIT : [0-9] ;
fragment CHAR : (LETTER | DIGIT | SYMB ) ;
WORD : CHAR+ ;
ID: LETTER ( LETTER | DIGIT)* ;
PUNCT : [.,?!] ;
fragment SYMB : ~[a-zA-Z0-9.,?! |{}<>] ;

Side note: I added "PUNCT?" at the end of the "item" rule because it is possible, such as in the example sentence I gave above, to have a comma appear right after a "func". But since you can also have a comma after a "WORD" then I decided to put the punctuation in "item" instead of in both of "func" and "WORD".

If I run this parser on the above sentence, I get a parse tree that looks like this:

Anything highlighted in red is a parse error.

So it is not recognizing the "ID" inside the double angle brackets as an "ID". Presumably this is because "WORD" comes first in my list of lexer rules. However, I have no rule that says "<< WORD >>", only a rule that says "<< ID >>", so I'm not clear on why that is happening.

If I swap the order of "ID" and "WORD" in my grammar, so now they are in this order:

ID: LETTER ( LETTER | DIGIT)* ;
WORD : CHAR+ ;

And run the parser, I get a parse tree like this:

So now the "func" and "ID" rules are being handled appropriately, but none of the "WORD"s are being recognized.

How do I get past this conundrum?

I suppose one option might be to change the "func" rule to "<< WORD >>" and just treat everything as words, doing away with "ID". But I wanted to differentiate a text word from a variable identifier (for instance, no special characters are allowed in a variable identifier).

Thanks for any help!

解决方案

From The Definitive ANTLR 4 Reference :

ANTLR resolves lexical ambiguities by matching the input string to the rule specified first in the grammar.

With your grammar (in Question.g4) and a t.text file containing

Hello << name >>, how are you at nine o'clock?

the execution of

$ grun Question doc -tokens -diagnostics t.text

gives

[@0,0:4='Hello',<WORD>,1:0]
[@1,6:7='<<',<'<<'>,1:6]
[@2,9:12='name',<WORD>,1:9]
[@3,14:15='>>',<'>>'>,1:14]
[@4,16:16=',',<PUNCT>,1:16]
[@5,18:20='how',<WORD>,1:18]
[@6,22:24='are',<WORD>,1:22]
[@7,26:28='you',<WORD>,1:26]
[@8,30:31='at',<WORD>,1:30]
[@9,33:36='nine',<WORD>,1:33]
[@10,38:44='o'clock',<WORD>,1:38]
[@11,45:45='?',<PUNCT>,1:45]
[@12,47:46='<EOF>',<EOF>,2:0]
line 1:9 mismatched input 'name' expecting ID
line 1:14 extraneous input '>>' expecting {<EOF>, '<<', WORD, PUNCT}

Now change WORD to word in the item rule, and add a word rule :

item: (func | word) PUNCT? ;
word: WORD | ID ;

and put ID before WORD :

ID: LETTER ( LETTER | DIGIT)* ;
WORD : CHAR+ ;

The tokens are now

[@0,0:4='Hello',<ID>,1:0]
[@1,6:7='<<',<'<<'>,1:6]
[@2,9:12='name',<ID>,1:9]
[@3,14:15='>>',<'>>'>,1:14]
[@4,16:16=',',<PUNCT>,1:16]
[@5,18:20='how',<ID>,1:18]
[@6,22:24='are',<ID>,1:22]
[@7,26:28='you',<ID>,1:26]
[@8,30:31='at',<ID>,1:30]
[@9,33:36='nine',<ID>,1:33]
[@10,38:44='o'clock',<WORD>,1:38]
[@11,45:45='?',<PUNCT>,1:45]
[@12,47:46='<EOF>',<EOF>,2:0]

and there is no more error. As the -gui graphic shows, you have now branches identified as word or func.

这篇关于使用 ANTLR4 在语法中对词法分析器规则进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆