使用ANTLR4在语法中对词法器规则进行排序 [英] Ordering lexer rules in a grammar using ANTLR4

查看:179
本文介绍了使用ANTLR4在语法中对词法器规则进行排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用ANTLR4生成解析器.我是解析器语法的新手.我已经阅读了非常有帮助的 ANTLR Mega教程,但我仍然对如何正确地使用它感到困惑排序(和/或编写)我的词法分析器和解析器规则.

I'm using ANTLR4 to generate a parser. I am new to parser grammars. I've read the very helpful ANTLR Mega Tutorial but I am still stuck on how to properly order (and/or write) my lexer and parser rules.

我希望解析器能够处理这样的事情:

I want the parser to be able to handle something like this:

你好<<名字>>,你好吗?

Hello << name >>, how are you?

在运行时,我将用用户名替换<< name >>".

At runtime I will replace "<< name >>" with the user's name.

因此,大多数情况下,我会解析文本单词(以及标点符号,符号等),但偶尔会带有<<某物>>"标记,而我在词法分析器规则中称其为"func".

So mostly I am parsing text words (and punctuation, symbols, etc), except with the occasional "<< something >>" tag, which I am calling a "func" in my lexer rules.

这是我的语法:

doc: item* EOF ;
item: (func | WORD) PUNCT? ;
func: '<<' ID '>>' ;

WS : [ \t\n\r] -> skip ;
fragment LETTER : [a-zA-Z] ;
fragment DIGIT : [0-9] ;
fragment CHAR : (LETTER | DIGIT | SYMB ) ;
WORD : CHAR+ ;
ID: LETTER ( LETTER | DIGIT)* ;
PUNCT : [.,?!] ;
fragment SYMB : ~[a-zA-Z0-9.,?! |{}<>] ;

旁注:我添加了"PUNCT?"在"item"规则的末尾,因为例如在我上面给出的例句中,有可能在"func"之后出现逗号.但是,由于在"WORD"之后也可以有一个逗号,因此我决定将标点符号放在"item"中,而不要放在"func"和"WORD"中.

Side note: I added "PUNCT?" at the end of the "item" rule because it is possible, such as in the example sentence I gave above, to have a comma appear right after a "func". But since you can also have a comma after a "WORD" then I decided to put the punctuation in "item" instead of in both of "func" and "WORD".

如果我在上面的句子上运行此解析器,则会得到一个如下所示的解析树:

If I run this parser on the above sentence, I get a parse tree that looks like this:

任何以红色突出显示的内容都是解析错误.

Anything highlighted in red is a parse error.

因此,它无法将双尖括号内的"ID"识别为"ID".大概是因为"WORD"在我的词法分析器规则列表中排在第一位.但是,我没有规则说<<单词>>",只有规则说<< ID >>",所以我不清楚为什么会这样.

So it is not recognizing the "ID" inside the double angle brackets as an "ID". Presumably this is because "WORD" comes first in my list of lexer rules. However, I have no rule that says "<< WORD >>", only a rule that says "<< ID >>", so I'm not clear on why that is happening.

如果我在语法中交换了"ID"和"WORD"的顺序,那么现在它们的顺序是:

If I swap the order of "ID" and "WORD" in my grammar, so now they are in this order:

ID: LETTER ( LETTER | DIGIT)* ;
WORD : CHAR+ ;

运行解析器,我得到一个解析树,如下所示:

And run the parser, I get a parse tree like this:

因此,现在已经适当地处理了"func"和"ID"规则,但是没有一个"WORD"被识别.

So now the "func" and "ID" rules are being handled appropriately, but none of the "WORD"s are being recognized.

我如何克服这个难题?

我想一个选择可能是将"func"规则更改为<< WORD >>",然后将所有内容都视为单词,而不再使用"ID".但是我想将文本单词与变量标识符区分开(例如,变量标识符中不允许使用特殊字符).

I suppose one option might be to change the "func" rule to "<< WORD >>" and just treat everything as words, doing away with "ID". But I wanted to differentiate a text word from a variable identifier (for instance, no special characters are allowed in a variable identifier).

感谢您的帮助!

推荐答案

来自

ANTLR通过以下方式解决词汇歧义 将输入字符串与语法中首先指定的规则进行匹配.

ANTLR resolves lexical ambiguities by matching the input string to the rule specified first in the grammar.

带有语法(在Question.g4中)和一个包含以下内容的t.text文件

With your grammar (in Question.g4) and a t.text file containing

Hello << name >>, how are you at nine o'clock?

执行

$ grun Question doc -tokens -diagnostics t.text

给予

[@0,0:4='Hello',<WORD>,1:0]
[@1,6:7='<<',<'<<'>,1:6]
[@2,9:12='name',<WORD>,1:9]
[@3,14:15='>>',<'>>'>,1:14]
[@4,16:16=',',<PUNCT>,1:16]
[@5,18:20='how',<WORD>,1:18]
[@6,22:24='are',<WORD>,1:22]
[@7,26:28='you',<WORD>,1:26]
[@8,30:31='at',<WORD>,1:30]
[@9,33:36='nine',<WORD>,1:33]
[@10,38:44='o'clock',<WORD>,1:38]
[@11,45:45='?',<PUNCT>,1:45]
[@12,47:46='<EOF>',<EOF>,2:0]
line 1:9 mismatched input 'name' expecting ID
line 1:14 extraneous input '>>' expecting {<EOF>, '<<', WORD, PUNCT}

现在在item规则中将WORD更改为word,并添加word规则:

Now change WORD to word in the item rule, and add a word rule :

item: (func | word) PUNCT? ;
word: WORD | ID ;

并将ID放在WORD之前:

and put ID before WORD :

ID: LETTER ( LETTER | DIGIT)* ;
WORD : CHAR+ ;

现在是令牌

[@0,0:4='Hello',<ID>,1:0]
[@1,6:7='<<',<'<<'>,1:6]
[@2,9:12='name',<ID>,1:9]
[@3,14:15='>>',<'>>'>,1:14]
[@4,16:16=',',<PUNCT>,1:16]
[@5,18:20='how',<ID>,1:18]
[@6,22:24='are',<ID>,1:22]
[@7,26:28='you',<ID>,1:26]
[@8,30:31='at',<ID>,1:30]
[@9,33:36='nine',<ID>,1:33]
[@10,38:44='o'clock',<WORD>,1:38]
[@11,45:45='?',<PUNCT>,1:45]
[@12,47:46='<EOF>',<EOF>,2:0]

,不再有错误.如-gui图形所示,您现在已将分支标识为wordfunc.

and there is no more error. As the -gui graphic shows, you have now branches identified as word or func.

这篇关于使用ANTLR4在语法中对词法器规则进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆