Antlrworks - 无关输入 [英] Antlrworks - extraneous input

查看:26
本文介绍了Antlrworks - 无关输入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是这方面的新手,因此我需要你的帮助..我正在尝试解析 Wikipedia Dump,我的第一步是将它们定义的每个规则映射到 ANTLR 中,不幸的是我遇到了第一个障碍:

第 1 行:8 无关输入 ''''' 期望 '\'\''

我不明白发生了什么,请帮帮我.

我的代码:

语法测试;选项 {语言 = Java;}解析: 术语+ EOF;学期:身份|'[[' 学期 ']]'|'\'\'' 学期 '\'\''|'\'\'\'' 学期 '\'\'\'';身份识别: ('a'..'z' | 'A'..'Z' | '0'..'9' | '=' | '#' | '"' | ' ')*;

输入'''''你好世界'''''

解决方案

词法分析器规则必须始终匹配至少 1 个字符.您的规则:

IDENT : ('a'..'z' | 'A'..'Z' | '0'..'9' | '=' | '#' | '"' | ' ')*;

匹配一个空字符串(其中有无数个).将 * 更改为 +:

IDENT : ('a'..'z' | 'A'..'Z' | '0'..'9' | '=' | '#' | '"' | ' ')+;

编辑

<块引用>

输入'''''Hello World'''''

尽管您将文字标记放在解析器规则中('\'\'\'''\'\'' 等),但您必须明白它们不是在解析器的要求下创建的.词法分析器遵循严格的规则来创建令牌:

  1. 它尝试尽可能多地匹配
  2. 如果 2 个不同的词法分析器规则匹配相同数量的字符,则第一个定义的将获得优先权

让我们为您的文字标记命名:

BRACKET_OPEN : '[[';BRACKET_CLOSE : ']]';Q3:'\'\'\'';Q2:'\'\'';IDENT : ('a'..'z' | 'A'..'Z' | '0'..'9' | '=' | '#' | '"' | ' ')+;

现在,由于规则 #1(尽可能匹配),输入 '''''Hello World''''' 将被标记如下:

  • Q3
  • Q2
  • 身份
  • Q3(是的,一个 Q3!)
  • Q2

但是您的解析器规则 term 将只接受 Q3 Q2 IDENT Q2 Q3,因此您的输入未能正确解析是正确的.

另外,我建议您不要使用解释器:它有很多问题.不过,调试器的工作原理很酷!

I am new in this stuff, and for that reason I will need your help.. I am trying to parse the Wikipedia Dump, and my first step is to map each rule defined by them into ANTLR, unfortunally I got my first barrier:

line 1:8 extraneous input ''''' expecting '\'\''

I am not understanding what is going on, please lend me your help.

My code:

grammar Test;

options {
    language = Java;
}

parse
    :  term+ EOF
    ;

term 
    :  IDENT
    |  '[[' term ']]'
    |  '\'\'' term '\'\''
    |  '\'\'\'' term '\'\'\''
    ;    

IDENT
    :  ('a'..'z' | 'A'..'Z' | '0'..'9' | '=' | '#' | '"' | ' ')*
    ;

Input '''''Hello World'''''

解决方案

A lexer rule must always match at least 1 character. Your rule:

IDENT : ('a'..'z' | 'A'..'Z' | '0'..'9' | '=' | '#' | '"' | ' ')*;

matches an empty string (of which there are an infinite amount of). Change the * to a +:

IDENT : ('a'..'z' | 'A'..'Z' | '0'..'9' | '=' | '#' | '"' | ' ')+;

EDIT

Input '''''Hello World'''''

Although you put literal tokens inside parser rules ('\'\'\'', '\'\'', etc.), you must understand that they are not created at the behest of the parser. The lexer follows strict rules to create tokens:

  1. it tries to match as much as possible
  2. if 2 different lexer rules match the same amount of characters, the one defined first will get precedence

Let's give your literal tokens a name:

BRACKET_OPEN  : '[[';
BRACKET_CLOSE : ']]';
Q3            : '\'\'\'';
Q2            : '\'\'';
IDENT         :  ('a'..'z' | 'A'..'Z' | '0'..'9' | '=' | '#' | '"' | ' ')+;

Now, because of rule #1 (match as much as possible), the input '''''Hello World''''' will be tokenized as follows:

  • Q3
  • Q2
  • IDENT
  • Q3 (yes, a Q3!)
  • Q2

But your parser rule term will only accept Q3 Q2 IDENT Q2 Q3, so it is correct that your input fails to parse properly.

Also, I recommend you not use the interpreter: it's rather buggy. The debugger works like a charm though!

这篇关于Antlrworks - 无关输入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆