Antlr解析python安装文件 [英] Antlr to parse python setup file

查看:97
本文介绍了Antlr解析python安装文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Java程序,必须解析python setup.py文件以从中提取信息.我有点工作,但碰壁了.我首先从一个简单的原始文件开始,一旦我开始运行它,那么我将担心去除那些我不想使其反映实际文件的噪音.

I have a java program that has to parse a python setup.py file to extract info from it. I sorta have something working, but I hit a wall. I am starting with a simple raw file first, once i get that running, then i will worry about stripping out the noise that i don't want to make it reflect an actual file.

这是我的语法

grammar SetupPy ;

file_input: (NEWLINE | setupDeclaration)* EOF;

setupDeclaration : 'setup' '(' method ')';
method : setupRequires testRequires;
setupRequires : 'setup_requires' '=' '[' LISTVAL* ']' COMMA;
testRequires : 'tests_require' '=' '[' LISTVAL* ']' COMMA;

WS: [ \t\n\r]+ -> skip ;
COMMA : ',' -> skip ;
LISTVAL : SHORT_STRING ;

UNKNOWN_CHAR
 : .
 ;

fragment SHORT_STRING
 : '\'' ( STRING_ESCAPE_SEQ | ~[\\\r\n\f'] )* '\''
 | '"' ( STRING_ESCAPE_SEQ | ~[\\\r\n\f"] )* '"'
 ;

/// stringescapeseq ::=  "\" <any source character>
fragment STRING_ESCAPE_SEQ
: '\\' .
| '\\' NEWLINE
;

fragment SPACES
 : [ \t]+
 ;

NEWLINE
 : ( {atStartOfInput()}?   SPACES
   | ( '\r'? '\n' | '\r' | '\f' ) SPACES?
   )
   {
     String newLine = getText().replaceAll("[^\r\n\f]+", "");
     String spaces = getText().replaceAll("[\r\n\f]+", "");
     int next = _input.LA(1);
     if (opened > 0 || next == '\r' || next == '\n' || next == '\f' || next == '#') {
       // If we're inside a list or on a blank line, ignore all indents,
       // dedents and line breaks.
       skip();
     }
     else {
       emit(commonToken(NEWLINE, newLine));
       int indent = getIndentationCount(spaces);
       int previous = indents.isEmpty() ? 0 : indents.peek();
       if (indent == previous) {
         // skip indents of the same size as the present indent-size
         skip();
       }
       else if (indent > previous) {
         indents.push(indent);
         emit(commonToken(Python3Parser.INDENT, spaces));
       }
       else {
         // Possibly emit more than 1 DEDENT token.
         while(!indents.isEmpty() && indents.peek() > indent) {
           this.emit(createDedent());
           indents.pop();
         }
       }
     }
   }
 ;

和我当前的测试文件(如我所说,下一步是从普通文件中去除噪音)

and my current test file (like i said, stripping the noise from a normal file is next step)

setup(
    setup_requires=['pytest-runner'],
    tests_require=['pytest', 'unittest2'],
)

我遇到的问题是如何告诉antlr setup_requires和tests_requires包含数组.我想要这些数组的值,而不管是否有人使用单引号,双引号,每个值在不同的行以及以上所有元素的组合.我不知道如何实现这一目标.我可以帮忙吗?也许是一两个例子?

Where i am stuck is how to tell antlr that setup_requires and tests_requires contain arrays. I want the values of those arrays, no matter if someone used single quotes, double quotes, each value on a different line, and combinations of all the above. I don't have a clue how to pull that off. Can i get some help please? maybe an example or two?

注意事项

  1. 不,我不能使用jython并仅运行文件.
  2. 由于该文件的开发人员样式差异很大,因此无法选择正则表达式

当然,在此问题之后,我仍然需要弄清楚如何从普通文件中消除噪音.我尝试使用Python3语法来做到这一点,但是我是antlr的新手,这让我大吃一惊.我不知道如何编写规则以提取值,所以我决定尝试一种更简单的语法.并迅速撞到另一堵墙.

And of course after this issue, I still need to figure out how to strip the noise from a normal file. I tried using the Python3 grammar to do this, but me being a novice at antlr, it blew me away. i couldn't figure out how to write the rules to pull the values, so I decided to try a far simpler grammar. And quickly hit another wall.

修改 这是一个最终将要解析的实际setup.py文件.请记住,setup_requires和test_requires可能存在或可能不存在,并且可能存在或不存在该顺序.

edit here is an actual setup.py file that it will eventually have to parse. keeping in mind the setup_requires and test_requires may or may not be there and may or may not be in that order.

# -*- coding: utf-8 -*-
from __future__ import with_statement

from setuptools import setup


def get_version(fname='mccabe.py'):
    with open(fname) as f:
        for line in f:
            if line.startswith('__version__'):
                return eval(line.split('=')[-1])


def get_long_description():
    descr = []
    for fname in ('README.rst',):
        with open(fname) as f:
            descr.append(f.read())
    return '\n\n'.join(descr)


setup(
    name='mccabe',
    version=get_version(),
    description="McCabe checker, plugin for flake8",
    long_description=get_long_description(),
    keywords='flake8 mccabe',
    author='Tarek Ziade',
    author_email='tarek@ziade.org',
    maintainer='Ian Cordasco',
    maintainer_email='graffatcolmingov@gmail.com',
    url='https://github.com/pycqa/mccabe',
    license='Expat license',
    py_modules=['mccabe'],
    zip_safe=False,
    setup_requires=['pytest-runner'],
    tests_require=['pytest'],
    entry_points={
        'flake8.extension': [
            'C90 = mccabe:McCabeChecker',
        ],
    },
    classifiers=[
        'Development Status :: 5 - Production/Stable',
        'Environment :: Console',
        'Intended Audience :: Developers',
        'License :: OSI Approved :: MIT License',
        'Operating System :: OS Independent',
        'Programming Language :: Python',
        'Programming Language :: Python :: 2',
        'Programming Language :: Python :: 2.7',
        'Programming Language :: Python :: 3',
        'Programming Language :: Python :: 3.3',
        'Programming Language :: Python :: 3.4',
        'Programming Language :: Python :: 3.5',
        'Programming Language :: Python :: 3.6',
        'Topic :: Software Development :: Libraries :: Python Modules',
        'Topic :: Software Development :: Quality Assurance',
    ],
)

尝试调试,简化并意识到我不需要找到方法,只需找到值即可.所以我在玩这种语法器

Trying to debug and simplify and realized i don't need to find the method, just the values. so I'm playing with this grammer

grammar SetupPy ;

file_input: (ignore setupRequires ignore | ignore testRequires ignore )* EOF;

setupRequires : 'setup_requires' '=' '[' dependencyValue* (',' dependencyValue)* ']';
testRequires : 'tests_require' '=' '[' dependencyValue* (',' dependencyValue)* ']';

dependencyValue: LISTVAL;

ignore : UNKNOWN_CHAR? ;

LISTVAL: SHORT_STRING;
UNKNOWN_CHAR: . -> channel(HIDDEN);

fragment SHORT_STRING: '\'' ( STRING_ESCAPE_SEQ | ~[\\\r\n\f'] )* '\''
| '"' ( STRING_ESCAPE_SEQ | ~[\\\r\n\f"] )* '"';

fragment STRING_ESCAPE_SEQ
: '\\' .
| '\\'
;

非常适合简单的程序,甚至可以处理乱序问题.但是无法处理完整文件,却被挂在

Works great for the simple one, even handles the out of order issue. but doesnt' work on the full file, it gets hung up on the

def get_version(fname='mccabe.py'):

等于在该行签名

推荐答案

我已经检查了您的语法并将其简化了很多.我取出所有python-esqe空格处理,只是将空格视为空格.该语法还会解析此输入,正如您在问题中所说,该输入每行处理一个项目,单引号和双引号等等.

I've examined your grammar and simplified it quite a bit. I took out all the python-esqe whitespace handling and just treated whitespace as whitespace. This grammar also parses this input, which as you said in the question, handles one item per line, single and double quotes, etc...

setup(
    setup_requires=['pytest-runner'],
    tests_require=['pytest', 
    'unittest2', 
    "test_3" ],
)

这是经过简化的语法:

grammar SetupPy ;
setupDeclaration : 'setup' '(' method ')' EOF;
method : setupRequires testRequires  ;
setupRequires : 'setup_requires' '=' '[' LISTVAL* (',' LISTVAL)* ']' ',' ;
testRequires : 'tests_require' '=' '[' LISTVAL* (',' LISTVAL)* ']' ',' ;
WS: [ \t\n\r]+ -> skip ;
LISTVAL : SHORT_STRING ;
fragment SHORT_STRING
 : '\'' ( STRING_ESCAPE_SEQ | ~[\\\r\n\f'] )* '\''
 | '"' ( STRING_ESCAPE_SEQ | ~[\\\r\n\f"] )* '"'
 ;
fragment STRING_ESCAPE_SEQ
: '\\' .
| '\\' 
;

哦,这是解析器-词法分析器的输出,显示标记的正确分配:

Oh and here's the parser-lexer output showing the correct assignment of tokens:

[@0,0:4='setup',<'setup'>,1:0]
[@1,5:5='(',<'('>,1:5]
[@2,12:25='setup_requires',<'setup_requires'>,2:4]
[@3,26:26='=',<'='>,2:18]
[@4,27:27='[',<'['>,2:19]
[@5,28:42=''pytest-runner'',<LISTVAL>,2:20]
[@6,43:43=']',<']'>,2:35]
[@7,44:44=',',<','>,2:36]
[@8,51:63='tests_require',<'tests_require'>,3:4]
[@9,64:64='=',<'='>,3:17]
[@10,65:65='[',<'['>,3:18]
[@11,66:73=''pytest'',<LISTVAL>,3:19]
[@12,74:74=',',<','>,3:27]
[@13,79:89=''unittest2'',<LISTVAL>,4:1]
[@14,90:90=',',<','>,4:12]
[@15,95:102='"test_3"',<LISTVAL>,5:1]
[@16,104:104=']',<']'>,5:10]
[@17,105:105=',',<','>,5:11]
[@18,108:108=')',<')'>,6:0]
[@19,109:108='<EOF>',<EOF>,6:1]

现在,您应该能够遵循简单的ANTLR访客或侦听器模式来获取您的LISTVAL令牌并对其进行处理.我希望这能满足您的需求.当然,它可以很好地解析您的测试输入,甚至更多.

Now you should be able to follow a simple ANTLR Visitor or Listener pattern to grab up your LISTVAL tokens and do your thing with them. I hope this meets your needs. It certainly parses your test input well, and more.

这篇关于Antlr解析python安装文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆