ANTLR获取和拆分词法分析器内容 [英] ANTLR get and split lexer content

查看:103
本文介绍了ANTLR获取和拆分词法分析器内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,对我的英语感到抱歉,我还在学习.

我为我的框架编写Python模块,该模块用于解析CSS文件.我尝试使用regex,ply(python lexer和parser),但发现自己在ANTLR中.

首先尝试,我需要解析CSS文件中的注释. 这是我要解析的CSS字符串:

/*test*/

/*
test1
/*

/*test2/*nested*/comment/*

我知道CSS不允许嵌套注释,但是我在我的框架中需要它.我写了简单的ANTLR语法:

grammar CSS;

options {
    language = Python;
}

styleSheet
    : comments EOF ;

comments
    : NESTED_ML_COMMENT*
    ;

NESTED_ML_COMMENT
    :   '/*' 
        (options {greedy=false;} : (NESTED_ML_COMMENT | . ) )* 
        '*/' 
    ;

LINEBREAK 
    :  ('\n\r' | '\n')+{$channel=HIDDEN; };

我得到的结果是:

我期望什么(油漆工作:D):

请注意,我想要/*和*/作为结果.

有没有办法在纯ANTLR中做到这一点?我在ANTLR中使用python没问题,但是如果有没有python的方法,我将不胜感激.

解决方案

不,没有简单的方法.由于NESTED_ML_COMMENT是一个词法分析器规则(简单"标记),因此您不能让解析器规则在源代码中像/*test2/*nested*/comment*/这样创建更多结构:词法分析器规则将始终保持扁平"字符序列.当然,有(简便)方法可以重写此字符序列(即删除/**/),但是可以创建父级-兄弟级结构.

为了创建像您在2 nd 图像中显示的层次结构,您将必须将注释规则提升"到解析器(因此使其成为解析器规则).在这种情况下,您的词法分析器将具有COMMENT_START : '/*';COMMENT_END : '*/';规则.但这会带来一堆蠕虫:在词法分析器内部,您现在还需要考虑/**/之间的所有字符.

可以创建另一个解析器,用于解析(嵌套的)注释,并将其用于您的CSS语法中.在您的CSS语法中,您只需保持原样,第二个解析器就是一个专用的注释解析器,它可以根据注释标记创建层次结构.

快速演示.语法:

grammar T;

parse
  :  comment EOF 
  ;

comment
  :  COMMENT_START (ANY | comment)* COMMENT_END
  ;

COMMENT_START : '/*';
COMMENT_END   : '*/';
ANY           :  . ;

会将源/*test2/*nested*/comment*/解析为以下解析树:

您可以重写,以便删除/**/.

在您的CSS语法中,然后执行以下操作:

comment
  :  NESTED_ML_COMMENT 
     {
       text = $NESTED_ML_COMMENT.text
       # invoke the TParser (my demo grammar) on `text`
     }
  ;

编辑

请注意,ANTLRWorks将创建它自己的内部分析树,您无法对其进行访问.如果您不告诉ANTLR生成正确的AST,您最终只会得到一堆简单的令牌(即使ANTLRWorks建议它是某种树).

以下是先前的问答,解释了如何创建适当的AST:ANY规则重命名为TEXT.目前,此规则一次仅匹配一个字符.但是,让它一直匹配到下一个/**/更为方便.这可以通过在执行此检查的lexer类中引入普通的Python方法来完成.在TEXT规则内部,我们将在谓词中使用该方法,以便*如果不是 后直接匹配/,则被匹配,而/如果是/,则匹配 not 后面紧跟*:

grammar Comment;

options {
  output=AST;
  language=Python;
}

tokens {
  COMMENT;
}

@lexer::members {
  def not_part_of_comment(self):
    current = self.input.LA(1)
    next = self.input.LA(2)
    if current == ord('*'): return next != ord('/')
    if current == ord('/'): return next != ord('*')  
    return True
}

parse
  :  comment EOF -> comment
  ;

comment
  :  COMMENT_START atom* COMMENT_END -> ^(COMMENT atom*)
  ;

atom
  :  TEXT
  |  comment
  ;

COMMENT_START : '/*';
COMMENT_END   : '*/';
TEXT          : ({self.not_part_of_comment()}?=> . )+ ;

在此问答中查找有关谓词语法{ boolean_expression }?=>的更多信息:

要进行全部测试,请确保已安装正确的Python运行时库(请参见 ANTLR版本3.1.3 .

像这样生成词法分析器和解析器:

java -cp antlr-3.1.3.jar org.antlr.Tool Comment.g 

并使用以下Python脚本测试词法分析器和解析器:

#!/usr/bin/env python

import antlr3
from antlr3 import *
from antlr3.tree import *
from CommentLexer import *
from CommentParser import *

# http://www.antlr.org/wiki/display/ANTLR3/Python+runtime
# http://www.antlr.org/download/antlr-3.1.3.jar

def print_level_order(tree, indent):
  print '{0}{1}'.format('   '*indent, tree.text)
  for child in tree.getChildren():
    print_level_order(child, indent+1)

input = '/*aaa1/*bbb/*ccc*/*/aaa2*/'
char_stream = antlr3.ANTLRStringStream(input)
lexer = CommentLexer(char_stream)
tokens = antlr3.CommonTokenStream(lexer)
parser = CommentParser(tokens)
tree = parser.parse().tree 
print_level_order(tree, 0)

如您所见,从源"/*aaa1/*bbb/*ccc*/*/aaa2*/"中创建了以下AST:

COMMENT
   aaa1
   COMMENT
      bbb
      COMMENT
         ccc
   aaa2

EDIT II

我也介意展示如何从CSS语法调用Comment解析器.这是一个快速演示:

grammar CSS;

options {
  output=AST;
  language=Python;
}

tokens {
  CSS_FILE;
  RULE;
  BLOCK;
  DECLARATION;
}

@parser::header {
import antlr3
from antlr3 import *
from antlr3.tree import *
from CommentLexer import *
from CommentParser import *
}

@parser::members {
  def parse_comment(self, text):
    lexer = CommentLexer(antlr3.ANTLRStringStream(text))
    parser = CommentParser(antlr3.CommonTokenStream(lexer))
    return parser.parse().tree 
}

parse
  :  atom+ EOF -> ^(CSS_FILE atom+)
  ;

atom
  :  rule
  |  Comment -> {self.parse_comment($Comment.text)}
  ;

rule
  :  Identifier declarationBlock -> ^(RULE Identifier declarationBlock)
  ;

declarationBlock
  :  '{' declaration+ '}' -> ^(BLOCK declaration+)
  ;

declaration
  :  a=Identifier ':' b=Identifier ';' -> ^(DECLARATION $a $b)
  ;

Identifier
  :  ('a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | '0'..'9')*
  ;

Comment
  :  '/*' (options {greedy=false;} : Comment | . )* '*/'
  ;

Space
  :  (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;}
  ;

如果您解析来源:

h1 {  a: b;  c: d;}

/*aaa1/*bbb/*ccc*/*/aaa2*/

p {x  :  y;}

使用CSSParser,您将获得以下树:

CSS_FILE
   RULE
      h1
      BLOCK
         DECLARATION
            a
            b
         DECLARATION
            c
            d
   COMMENT
      aaa1
      COMMENT
         bbb
         COMMENT
            ccc
      aaa2
   RULE
      p
      BLOCK
         DECLARATION
            x
            y

通过运行以下测试脚本可以看到:

#!/usr/bin/env python

import antlr3
from antlr3 import *
from antlr3.tree import *
from CSSLexer import *
from CSSParser import *

def print_level_order(tree, indent):
  print '{0}{1}'.format('   '*indent, tree.text)
  for child in tree.getChildren():
    print_level_order(child, indent+1)

input = 'h1 {  a: b;  c: d;}\n\n/*aaa1/*bbb/*ccc*/*/aaa2*/\n\np {x  :  y;}'
char_stream = antlr3.ANTLRStringStream(input)
lexer = CSSLexer(char_stream)
tokens = antlr3.CommonTokenStream(lexer)
parser = CSSParser(tokens)
tree = parser.parse().tree 
print_level_order(tree, 0)

first, sorry about my english, i still learning.

I writing Python module for my framework, which parsing CSS files. I try regex, ply (python lexer and parser), but i found myself in ANTLR.

First try, i need to parse comments from CSS file. This is my CSS string to parse:

/*test*/

/*
test1
/*

/*test2/*nested*/comment/*

I know that CSS doesn't allow nested comments, but i need it in my framework. I wrote simple ANTLR grammar:

grammar CSS;

options {
    language = Python;
}

styleSheet
    : comments EOF ;

comments
    : NESTED_ML_COMMENT*
    ;

NESTED_ML_COMMENT
    :   '/*' 
        (options {greedy=false;} : (NESTED_ML_COMMENT | . ) )* 
        '*/' 
    ;

LINEBREAK 
    :  ('\n\r' | '\n')+{$channel=HIDDEN; };

What i get in result is:

What i expect (paint work :D):

Notice that i don't want /* and */ in result.

Is there any way to do this in pure ANTLR? I have no problem with using python in ANTLR, but if there any way to do this without python i will be grateful.

解决方案

No, there is no easy way. Since NESTED_ML_COMMENT is a lexer rule (a "simple" token), you cannot let a parser rule create any more structure in source like /*test2/*nested*/comment*/: lexer rules will always stay a "flat" sequence of characters. Sure, there are (easy) ways to rewrite this character sequence (ie. remove /* and */), but creating parent-sibling hierarchies, no.

In order to create a hierarchy like you displayed in your 2nd image, you will have to "promote" your comment-rule to the parser (so make it into a parser rule). In that case, your lexer would have a COMMENT_START : '/*'; and COMMENT_END : '*/'; rule. But that opens a can of worms: inside your lexer you would now also need to account for all characters that can come between /* and */.

You could create another parser that parses (nested) comments and use that inside your CSS grammar. Inside your CSS grammar, you simply keep it as it is, and your second parser is a dedicated comments-parser that creates a hierarchy from the comment-tokens.

A quick demo. The grammar:

grammar T;

parse
  :  comment EOF 
  ;

comment
  :  COMMENT_START (ANY | comment)* COMMENT_END
  ;

COMMENT_START : '/*';
COMMENT_END   : '*/';
ANY           :  . ;

will parse the source /*test2/*nested*/comment*/ into the following parse tree:

which you can rewrite so that /* and */ are removed, of course.

Inside your CSS grammar, you then do:

comment
  :  NESTED_ML_COMMENT 
     {
       text = $NESTED_ML_COMMENT.text
       # invoke the TParser (my demo grammar) on `text`
     }
  ;

EDIT

Note that ANTLRWorks creates it's own internal parse tree to which you have no access. If you do not tell ANTLR to generate a proper AST, you will just end up with a flat list of tokens (even though ANTLRWorks suggests it is some sort of tree).

Here's a previous Q&A that explains how to create a proper AST: How to output the AST built using ANTLR?

Now let's get back to the "comment" grammar I posted above. I'll rename the ANY rule to TEXT. At the moment, this rule only matches a single character at a time. But it's more convenient to let it match all the way up to the next /* or */. This can be done by introducing a plain Python method in the lexer class that performs this check. Inside the TEXT rule, we'll use that method inside a predicate so that * gets matched if it's not directly followed by a /, and a / gets matched if it's not directly followed by a *:

grammar Comment;

options {
  output=AST;
  language=Python;
}

tokens {
  COMMENT;
}

@lexer::members {
  def not_part_of_comment(self):
    current = self.input.LA(1)
    next = self.input.LA(2)
    if current == ord('*'): return next != ord('/')
    if current == ord('/'): return next != ord('*')  
    return True
}

parse
  :  comment EOF -> comment
  ;

comment
  :  COMMENT_START atom* COMMENT_END -> ^(COMMENT atom*)
  ;

atom
  :  TEXT
  |  comment
  ;

COMMENT_START : '/*';
COMMENT_END   : '*/';
TEXT          : ({self.not_part_of_comment()}?=> . )+ ;

Find out more about the predicate syntax, { boolean_expression }?=>, in this Q&A: What is a 'semantic predicate' in ANTLR?

To test this all, make sure you have the proper Python runtime libraries installed (see the ANTLR Wiki). And be sure to use ANTLR version 3.1.3 with this runtime.

Generate the lexer- and parser like this:

java -cp antlr-3.1.3.jar org.antlr.Tool Comment.g 

and test the lexer and parser with the following Python script:

#!/usr/bin/env python

import antlr3
from antlr3 import *
from antlr3.tree import *
from CommentLexer import *
from CommentParser import *

# http://www.antlr.org/wiki/display/ANTLR3/Python+runtime
# http://www.antlr.org/download/antlr-3.1.3.jar

def print_level_order(tree, indent):
  print '{0}{1}'.format('   '*indent, tree.text)
  for child in tree.getChildren():
    print_level_order(child, indent+1)

input = '/*aaa1/*bbb/*ccc*/*/aaa2*/'
char_stream = antlr3.ANTLRStringStream(input)
lexer = CommentLexer(char_stream)
tokens = antlr3.CommonTokenStream(lexer)
parser = CommentParser(tokens)
tree = parser.parse().tree 
print_level_order(tree, 0)

As you can see, from the source "/*aaa1/*bbb/*ccc*/*/aaa2*/", the following AST is created:

COMMENT
   aaa1
   COMMENT
      bbb
      COMMENT
         ccc
   aaa2

EDIT II

I mind as well show how you can invoke the Comment parser from your CSS grammar. Here's a quick demo:

grammar CSS;

options {
  output=AST;
  language=Python;
}

tokens {
  CSS_FILE;
  RULE;
  BLOCK;
  DECLARATION;
}

@parser::header {
import antlr3
from antlr3 import *
from antlr3.tree import *
from CommentLexer import *
from CommentParser import *
}

@parser::members {
  def parse_comment(self, text):
    lexer = CommentLexer(antlr3.ANTLRStringStream(text))
    parser = CommentParser(antlr3.CommonTokenStream(lexer))
    return parser.parse().tree 
}

parse
  :  atom+ EOF -> ^(CSS_FILE atom+)
  ;

atom
  :  rule
  |  Comment -> {self.parse_comment($Comment.text)}
  ;

rule
  :  Identifier declarationBlock -> ^(RULE Identifier declarationBlock)
  ;

declarationBlock
  :  '{' declaration+ '}' -> ^(BLOCK declaration+)
  ;

declaration
  :  a=Identifier ':' b=Identifier ';' -> ^(DECLARATION $a $b)
  ;

Identifier
  :  ('a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | '0'..'9')*
  ;

Comment
  :  '/*' (options {greedy=false;} : Comment | . )* '*/'
  ;

Space
  :  (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;}
  ;

If you parse the source:

h1 {  a: b;  c: d;}

/*aaa1/*bbb/*ccc*/*/aaa2*/

p {x  :  y;}

with the CSSParser, you'll get the following tree:

CSS_FILE
   RULE
      h1
      BLOCK
         DECLARATION
            a
            b
         DECLARATION
            c
            d
   COMMENT
      aaa1
      COMMENT
         bbb
         COMMENT
            ccc
      aaa2
   RULE
      p
      BLOCK
         DECLARATION
            x
            y

as you can see by running this test script:

#!/usr/bin/env python

import antlr3
from antlr3 import *
from antlr3.tree import *
from CSSLexer import *
from CSSParser import *

def print_level_order(tree, indent):
  print '{0}{1}'.format('   '*indent, tree.text)
  for child in tree.getChildren():
    print_level_order(child, indent+1)

input = 'h1 {  a: b;  c: d;}\n\n/*aaa1/*bbb/*ccc*/*/aaa2*/\n\np {x  :  y;}'
char_stream = antlr3.ANTLRStringStream(input)
lexer = CSSLexer(char_stream)
tokens = antlr3.CommonTokenStream(lexer)
parser = CSSParser(tokens)
tree = parser.parse().tree 
print_level_order(tree, 0)

这篇关于ANTLR获取和拆分词法分析器内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆