使用Python目标处理ANTLR4语法中的换行 [英] Handling line feed in ANTLR4 grammar with Python target

查看:157
本文介绍了使用Python目标处理ANTLR4语法中的换行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究 ANTLR4语法,用于解析Python DSL脚本(基本上是Python的一个子集),目标设置为 Python 3 。我在处理换行时遇到困难。



在我的语法中,我使用 lexer :: members NEWLINE 基于



对此我将不胜感激,因为我不明白为什么我的 NEWLINE 词法分析器规则无法匹配的原因\r\n ,并且我想在DSL中允许空行。

解决方案


所以问题肯定是\r\n没有被
NEWLINE词法分析器规则匹配


还有另一种解释。 LL(1)解析器将在第一个不匹配处停止,但是ANTLR4是一个非常聪明的LL(*):它将尝试匹配不匹配后的输入。



我没有您的声明规则,也没有您在第15行附近的输入内容,我将通过以下语法演示一种可能的情况:

 语法问题; 

/ *多余的输入解析NL和空格。 * /

@lexer :: members {
public boolean at_start_of_input(){return true;}; //即使始终返回true,也不是问题的原因
}

问题
@init {System.out.println(问题最后更新2108) ;}
:( NEWLINE
|语句
{System.out.println( found<< + $ statement.text +>>));}
)* EOF
;

语句
:行 NUMBER NEWLINE其他 NEWLINE


NUMBER:[0-9] +;
NEWLINE
:({at_start_of_input()}?空间
|('\r'?'\n'|'\r'|'\f')空间?

;

跳过_
:空间->跳过
;

片段空间
:[\t] +

输入文件t.text:

 第1行
其他

执行: p>

  $ export CLASSPATH =。://usr/local/lib/antlr-4.6-complete.jar 
$别名
alias a4 ='java -jar /usr/local/lib/antlr-4.6-complete.jar'
别名grun ='java org.antlr.v4.gui.TestRig'
$ hexdump -C t.text
00000000 6c 69 6e 65 20 31 0a 20 20 20 73 6f 6d 65 74 68 |第1行。
00000010 69 6e 67 20 65 6c 73 65 0a |其他。|
00000019
$ a4 Question.g4
$ javac Q * .java
$ grun问题问题-令牌-诊断t.text
[@ 0,0:4 ='line',<'line'>,1:0]
[@ 1,5:5 ='1',< NUMBER>,1:5]
[@ 2, 6:9 ='\n',< NEWLINE>,1:6]
[@ 3,10:23 ='其他东西',<'其他东西',2:3]
[@ 4,24:24 ='\n',< NEWLINE>,2:17]
[@ 5,25:24 ='< EOF>',< EOF> ;, 3:0]
问题最新更新2108
找到<<<第1行
其他
>>

现在更改声明就像这样:

 声明
// //:'line'NUMBER NEWLINE'something'NEWLINE
:'line'NUMBER'其他的东西NEWLINE //现在NL将是多余的

并再次执行:

  $ a4 Question.g4 
$ javac Q * .java
$ grun问题-tokens -diagnostics t.text
[@ 0,0:4 ='行',<'行'>,1:0]
[@ 1,5:5 ='1',< NUMBER>,1:5]
[@ 2,6: 9 ='\n',< NEWLINE>,1:6]
[@ 3,10:23 ='其他东西',<'其他东西',2:3]
[@ 4,24:24 ='\n',< NEWLINE>,2:17]
[@ 5,25:24 ='< EOF>'',< EOF&,3: 0]
问题最新更新2114
第1:6行,多余的输入'\n',期望有其他东西
<<<第1行
其他
>>

请注意, NEWLINE 词法分析器规则。



您可以在



摘要:外部输入相当开发语法时的常见错误。这可能是由于要解析的输入规则期望之间的不匹配,也可能是因为某些输入已经被我们认为不是的其他标记解释了。通过检查-令牌选项产生的令牌列表来检测。


I am working on an ANTLR4 grammar for parsing Python DSL scripts (a subset of Python, basically) with the target set as the Python 3. I am having difficulties handling the line feed.

In my grammar, I use lexer::members and NEWLINE embedded code based on Bart Kiers's Python3 grammar for ANTLR4 which are ported to Python so that they can be used with Python 3 runtime for ANTLR instead of Java. My grammar differs from the one provided by Bart (which is almost the same used in the Python 3 spec) since in my DSL I need to target only certain elements of Python. Based on extensive testing of my grammar, I do think that the Python part of the grammar in itself is not the source of the problem and so I won't post it here in full for now.

The input for the grammar is a file, catched by the file_input rule:

file_input: (NEWLINE | statement)* EOF;

The grammar performs rather well on my DSL and produces correct ASTs. The only problem I have is that my lexer rule NEWLINE clutters the AST with \r\n nodes and proves troublesome when trying to extend the generated MyGrammarListener with my own ExtendedListener which inherits from it.

Here is my NEWLINE lexer rule:

NEWLINE
 : ( {self.at_start_of_input()}? SPACES
   | ( '\r'? '\n' | '\r' | '\f' ) SPACES?
   )
   {
    import re
    from MyParser import MyParser
    new_line = re.sub(r"[^\r\n\f]+", "", self._interp.getText(self._input)) 
    spaces = re.sub(r"[\r\n\f]+", "", self._interp.getText(self._input)) 
    next = self._input.LA(1)

    if self.opened > 0 or next == '\r' or next == '\n' or next == '\f' or next == '#':
        self.skip()
    else:
        self.emit_token(self.common_token(self.NEWLINE, new_line))

        indent = self.get_indentation_count(spaces)
        if len(self.indents) == 0:
            previous = 0
        else:
            previous = self.indents[-1]

        if indent == previous:
            self.skip()
        elif indent > previous:
            self.indents.append(indent)
            self.emit_token(self.common_token(MyParser.INDENT, spaces))
        else:
            while len(self.indents) > 0 and self.indents[-1] > indent:
                self.emit_token(self.create_dedent())
                del self.indents[-1]
   };

The SPACES lexer rule fragment that NEWLINE uses is here:

 fragment SPACES
 : [ \t]+
 ;

I feel I should also add that both SPACES and COMMENTS are ultimately being skipped by the grammar, but only after the NEWLINE lexer rule is declared, which, as far as I know, should mean that there are no adverse effects from that, but I wanted to include it just in case.

SKIP_
 : ( SPACES | COMMENT ) -> skip
 ;

When the input file is run without any empty lines between statements, everything runs as it should. However, if there are empty lines in my file (such as between import statements and variable assignement), I get the following errors:

line 15:4 extraneous input '\r\n    ' expecting {<EOF>, 'from', 'import', NEWLINE, NAME}
line 15:0 extraneous input '\r\n' expecting {<EOF>, 'from', 'import', NEWLINE, NAME}

As I said before, when line feeds are omitted in my input file, the grammar and my ExtendedListener perform as they should, so the problem is definitely with the \r\n not being matched by the NEWLINE lexer rule - even the error statement I get says that it does not match alternative NEWLINE.

The AST produced by my grammar looks like this:

I would really appreciate any help with this since I cannot see why my NEWLINE lexer rule woud fail to match \r\n as it should and I would like to allow empty lines in my DSL.

解决方案

so the problem is definitely with the \r\n not being matched by the NEWLINE lexer rule

There is another explanation. An LL(1) parser would stop at the first mismatch, but ANTLR4 is a very smart LL(*) : it tries to match the input past the mismatch.

As I don't have your statement rule and your input around line 15, I'll demonstrate a possible case with the following grammar :

grammar Question;

/* Extraneous input parsing NL and spaces. */

@lexer::members {
  public boolean at_start_of_input() {return true;}; // even if it always returns true, it's not the cause of the problem
}

question
@init {System.out.println("Question last update 2108");}
    :   ( NEWLINE
    |     statement
              {System.out.println("found <<" + $statement.text + ">>");}
        )* EOF
    ;

statement
    :   'line ' NUMBER NEWLINE 'something else' NEWLINE
    ;

NUMBER : [0-9]+ ;
NEWLINE
    : ( {at_start_of_input()}? SPACES
       | ( '\r'? '\n' | '\r' | '\f' ) SPACES?
      )
   ;

SKIP_
    :   SPACES -> skip
    ;

fragment SPACES
    :   [ \t]+
    ;

Input file t.text :

line 1
   something else

Execution :

$ export CLASSPATH=".:/usr/local/lib/antlr-4.6-complete.jar"
$ alias
alias a4='java -jar /usr/local/lib/antlr-4.6-complete.jar'
alias grun='java org.antlr.v4.gui.TestRig'
$ hexdump -C t.text 
00000000  6c 69 6e 65 20 31 0a 20  20 20 73 6f 6d 65 74 68  |line 1.   someth|
00000010  69 6e 67 20 65 6c 73 65  0a                       |ing else.|
00000019
$ a4 Question.g4
$ javac Q*.java
$ grun Question question -tokens -diagnostics t.text
[@0,0:4='line ',<'line '>,1:0]
[@1,5:5='1',<NUMBER>,1:5]
[@2,6:9='\n   ',<NEWLINE>,1:6]
[@3,10:23='something else',<'something else'>,2:3]
[@4,24:24='\n',<NEWLINE>,2:17]
[@5,25:24='<EOF>',<EOF>,3:0]
Question last update 2108
found <<line 1
   something else
>>

Now change statement like so :

statement
//  :   'line ' NUMBER NEWLINE 'something else' NEWLINE
    :   'line ' NUMBER         'something else' NEWLINE // now NL will be extraneous
    ;

and execute again :

$ a4 Question.g4
$ javac Q*.java
$ grun Question question -tokens -diagnostics t.text
[@0,0:4='line ',<'line '>,1:0]
[@1,5:5='1',<NUMBER>,1:5]
[@2,6:9='\n   ',<NEWLINE>,1:6]
[@3,10:23='something else',<'something else'>,2:3]
[@4,24:24='\n',<NEWLINE>,2:17]
[@5,25:24='<EOF>',<EOF>,3:0]
Question last update 2114
line 1:6 extraneous input '\n   ' expecting 'something else'
found <<line 1
   something else
>>

Note that the NL character and spaces have been correctly matched by the NEWLINE lexer rule.

You can find the explanation in section 9.1 of The Definitive ANTLR 4 Reference :

$ grun Simple prog ➾ class T ; { int i; } ➾EOF ❮ line 1:8 extraneous input ';' expecting '{'

A Parade of Errors • 153

The parser reports an error at the ; but gives a slightly more informative answer because it knows that the next token is what it was actually looking for. This feature is called single-token deletion because the parser can simply pretend the extraneous token isn’t there and keep going.

Similarly, the parser can do single-token insertion when it detects a missing token.

In other word, ANTLR4 is so powerful that it can resynchronize the input with the grammar even if several tokens are mismatching. If you run with the -gui option

$ grun Question question -gui t.text

you can see that ANTLR4 has parsed the whole file, despite the fact that a NEWLINE is missing in the statement rule, and that the input does not match exactly the grammar.

To summary : extraneous input is quite a common error when developing a grammar. It can come from a mismatch between input to parse and rule expectations, or also because some piece of input has been interpreted by another token than the one we believe, which can be detected by examining the list of tokens produced by the -tokens option.

这篇关于使用Python目标处理ANTLR4语法中的换行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆