使用ANTLR4在语法中对词法器规则进行排序 [英] Ordering lexer rules in a grammar using ANTLR4
问题描述
我正在使用ANTLR4生成解析器.我是解析器语法的新手.我已经阅读了非常有帮助的 ANTLR Mega教程,但我仍然对如何正确地使用它感到困惑排序(和/或编写)我的词法分析器和解析器规则.
I'm using ANTLR4 to generate a parser. I am new to parser grammars. I've read the very helpful ANTLR Mega Tutorial but I am still stuck on how to properly order (and/or write) my lexer and parser rules.
我希望解析器能够处理这样的事情:
I want the parser to be able to handle something like this:
你好<<名字>>,你好吗?
Hello << name >>, how are you?
在运行时,我将用用户名替换<< name >>".
At runtime I will replace "<< name >>" with the user's name.
因此,大多数情况下,我会解析文本单词(以及标点符号,符号等),但偶尔会带有<<某物>>"标记,而我在词法分析器规则中称其为"func".
So mostly I am parsing text words (and punctuation, symbols, etc), except with the occasional "<< something >>" tag, which I am calling a "func" in my lexer rules.
这是我的语法:
doc: item* EOF ;
item: (func | WORD) PUNCT? ;
func: '<<' ID '>>' ;
WS : [ \t\n\r] -> skip ;
fragment LETTER : [a-zA-Z] ;
fragment DIGIT : [0-9] ;
fragment CHAR : (LETTER | DIGIT | SYMB ) ;
WORD : CHAR+ ;
ID: LETTER ( LETTER | DIGIT)* ;
PUNCT : [.,?!] ;
fragment SYMB : ~[a-zA-Z0-9.,?! |{}<>] ;
旁注:我添加了"PUNCT?"在"item"规则的末尾,因为例如在我上面给出的例句中,有可能在"func"之后出现逗号.但是,由于在"WORD"之后也可以有一个逗号,因此我决定将标点符号放在"item"中,而不要放在"func"和"WORD"中.
Side note: I added "PUNCT?" at the end of the "item" rule because it is possible, such as in the example sentence I gave above, to have a comma appear right after a "func". But since you can also have a comma after a "WORD" then I decided to put the punctuation in "item" instead of in both of "func" and "WORD".
如果我在上面的句子上运行此解析器,则会得到一个如下所示的解析树:
If I run this parser on the above sentence, I get a parse tree that looks like this:
任何以红色突出显示的内容都是解析错误.
Anything highlighted in red is a parse error.
因此,它无法将双尖括号内的"ID"识别为"ID".大概是因为"WORD"在我的词法分析器规则列表中排在第一位.但是,我没有规则说<<单词>>",只有规则说<< ID >>",所以我不清楚为什么会这样.
So it is not recognizing the "ID" inside the double angle brackets as an "ID". Presumably this is because "WORD" comes first in my list of lexer rules. However, I have no rule that says "<< WORD >>", only a rule that says "<< ID >>", so I'm not clear on why that is happening.
如果我在语法中交换了"ID"和"WORD"的顺序,那么现在它们的顺序是:
If I swap the order of "ID" and "WORD" in my grammar, so now they are in this order:
ID: LETTER ( LETTER | DIGIT)* ;
WORD : CHAR+ ;
运行解析器,我得到一个解析树,如下所示:
And run the parser, I get a parse tree like this:
因此,现在已经适当地处理了"func"和"ID"规则,但是没有一个"WORD"被识别.
So now the "func" and "ID" rules are being handled appropriately, but none of the "WORD"s are being recognized.
我如何克服这个难题?
我想一个选择可能是将"func"规则更改为<< WORD >>",然后将所有内容都视为单词,而不再使用"ID".但是我想将文本单词与变量标识符区分开(例如,变量标识符中不允许使用特殊字符).
I suppose one option might be to change the "func" rule to "<< WORD >>" and just treat everything as words, doing away with "ID". But I wanted to differentiate a text word from a variable identifier (for instance, no special characters are allowed in a variable identifier).
感谢您的帮助!