Antlrworks - 无关输入 [英] Antlrworks - extraneous input
问题描述
我是这方面的新手,因此我需要你的帮助..我正在尝试解析 Wikipedia Dump,我的第一步是将它们定义的每个规则映射到 ANTLR 中,不幸的是我遇到了第一个障碍:
第 1 行:8 无关输入 ''''' 期望 '\'\''
我不明白发生了什么,请帮帮我.
我的代码:
语法测试;选项 {语言 = Java;}解析: 术语+ EOF;学期:身份|'[[' 学期 ']]'|'\'\'' 学期 '\'\''|'\'\'\'' 学期 '\'\'\'';身份识别: ('a'..'z' | 'A'..'Z' | '0'..'9' | '=' | '#' | '"' | ' ')*;
输入'''''你好世界'''''
词法分析器规则必须始终匹配至少 1 个字符.您的规则:
IDENT : ('a'..'z' | 'A'..'Z' | '0'..'9' | '=' | '#' | '"' | ' ')*;
匹配一个空字符串(其中有无数个).将 *
更改为 +
:
IDENT : ('a'..'z' | 'A'..'Z' | '0'..'9' | '=' | '#' | '"' | ' ')+;
编辑
<块引用>输入'''''Hello World'''''
尽管您将文字标记放在解析器规则中('\'\'\''
、'\'\''
等),但您必须明白它们不是在解析器的要求下创建的.词法分析器遵循严格的规则来创建令牌:
- 它尝试尽可能多地匹配
- 如果 2 个不同的词法分析器规则匹配相同数量的字符,则第一个定义的将获得优先权
让我们为您的文字标记命名:
BRACKET_OPEN : '[[';BRACKET_CLOSE : ']]';Q3:'\'\'\'';Q2:'\'\'';IDENT : ('a'..'z' | 'A'..'Z' | '0'..'9' | '=' | '#' | '"' | ' ')+;
现在,由于规则 #1(尽可能匹配),输入 '''''Hello World'''''
将被标记如下:
Q3
Q2
身份
Q3
(是的,一个Q3
!)Q2
但是您的解析器规则 term
将只接受 Q3 Q2 IDENT Q2 Q3
,因此您的输入未能正确解析是正确的.
另外,我建议您不要使用解释器:它有很多问题.不过,调试器的工作原理很酷!
I am new in this stuff, and for that reason I will need your help.. I am trying to parse the Wikipedia Dump, and my first step is to map each rule defined by them into ANTLR, unfortunally I got my first barrier:
line 1:8 extraneous input ''''' expecting '\'\''
I am not understanding what is going on, please lend me your help.
My code:
grammar Test;
options {
language = Java;
}
parse
: term+ EOF
;
term
: IDENT
| '[[' term ']]'
| '\'\'' term '\'\''
| '\'\'\'' term '\'\'\''
;
IDENT
: ('a'..'z' | 'A'..'Z' | '0'..'9' | '=' | '#' | '"' | ' ')*
;
Input '''''Hello World'''''
A lexer rule must always match at least 1 character. Your rule:
IDENT : ('a'..'z' | 'A'..'Z' | '0'..'9' | '=' | '#' | '"' | ' ')*;
matches an empty string (of which there are an infinite amount of). Change the *
to a +
:
IDENT : ('a'..'z' | 'A'..'Z' | '0'..'9' | '=' | '#' | '"' | ' ')+;
EDIT
Input
'''''Hello World'''''
Although you put literal tokens inside parser rules ('\'\'\''
, '\'\''
, etc.), you must understand that they are not created at the behest of the parser. The lexer follows strict rules to create tokens:
- it tries to match as much as possible
- if 2 different lexer rules match the same amount of characters, the one defined first will get precedence
Let's give your literal tokens a name:
BRACKET_OPEN : '[[';
BRACKET_CLOSE : ']]';
Q3 : '\'\'\'';
Q2 : '\'\'';
IDENT : ('a'..'z' | 'A'..'Z' | '0'..'9' | '=' | '#' | '"' | ' ')+;
Now, because of rule #1 (match as much as possible), the input '''''Hello World'''''
will be tokenized as follows:
Q3
Q2
IDENT
Q3
(yes, aQ3
!)Q2
But your parser rule term
will only accept Q3 Q2 IDENT Q2 Q3
, so it is correct that your input fails to parse properly.
Also, I recommend you not use the interpreter: it's rather buggy. The debugger works like a charm though!
这篇关于Antlrworks - 无关输入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!