Antlr解析python安装文件 [英] Antlr to parse python setup file

查看:27
本文介绍了Antlr解析python安装文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 java 程序,它必须解析 python setup.py 文件才能从中提取信息.我有点工作,但我碰壁了.我首先从一个简单的原始文件开始,一旦我运行它,然后我就会担心去除我不想让它反映实际文件的噪音.

I have a java program that has to parse a python setup.py file to extract info from it. I sorta have something working, but I hit a wall. I am starting with a simple raw file first, once i get that running, then i will worry about stripping out the noise that i don't want to make it reflect an actual file.

这是我的语法

grammar SetupPy ;

file_input: (NEWLINE | setupDeclaration)* EOF;

setupDeclaration : 'setup' '(' method ')';
method : setupRequires testRequires;
setupRequires : 'setup_requires' '=' '[' LISTVAL* ']' COMMA;
testRequires : 'tests_require' '=' '[' LISTVAL* ']' COMMA;

WS: [ \t\n\r]+ -> skip ;
COMMA : ',' -> skip ;
LISTVAL : SHORT_STRING ;

UNKNOWN_CHAR
 : .
 ;

fragment SHORT_STRING
 : '\'' ( STRING_ESCAPE_SEQ | ~[\\\r\n\f'] )* '\''
 | '"' ( STRING_ESCAPE_SEQ | ~[\\\r\n\f"] )* '"'
 ;

/// stringescapeseq ::=  "\" <any source character>
fragment STRING_ESCAPE_SEQ
: '\\' .
| '\\' NEWLINE
;

fragment SPACES
 : [ \t]+
 ;

NEWLINE
 : ( {atStartOfInput()}?   SPACES
   | ( '\r'? '\n' | '\r' | '\f' ) SPACES?
   )
   {
     String newLine = getText().replaceAll("[^\r\n\f]+", "");
     String spaces = getText().replaceAll("[\r\n\f]+", "");
     int next = _input.LA(1);
     if (opened > 0 || next == '\r' || next == '\n' || next == '\f' || next == '#') {
       // If we're inside a list or on a blank line, ignore all indents,
       // dedents and line breaks.
       skip();
     }
     else {
       emit(commonToken(NEWLINE, newLine));
       int indent = getIndentationCount(spaces);
       int previous = indents.isEmpty() ? 0 : indents.peek();
       if (indent == previous) {
         // skip indents of the same size as the present indent-size
         skip();
       }
       else if (indent > previous) {
         indents.push(indent);
         emit(commonToken(Python3Parser.INDENT, spaces));
       }
       else {
         // Possibly emit more than 1 DEDENT token.
         while(!indents.isEmpty() && indents.peek() > indent) {
           this.emit(createDedent());
           indents.pop();
         }
       }
     }
   }
 ;

和我当前的测试文件(就像我说的,下一步是从普通文件中去除噪声)

and my current test file (like i said, stripping the noise from a normal file is next step)

setup(
    setup_requires=['pytest-runner'],
    tests_require=['pytest', 'unittest2'],
)

我遇到的问题是如何告诉 antlr setup_requires 和 tests_requires 包含数组.我想要这些数组的值,无论是否有人使用单引号、双引号、不同行上的每个值以及上述所有内容的组合.我不知道如何解决这个问题.我可以得到一些帮助吗?也许是一两个例子?

Where i am stuck is how to tell antlr that setup_requires and tests_requires contain arrays. I want the values of those arrays, no matter if someone used single quotes, double quotes, each value on a different line, and combinations of all the above. I don't have a clue how to pull that off. Can i get some help please? maybe an example or two?

注意事项

  1. 不,我不能使用 jython 并且只运行文件.
  2. Regex 不是一个选项,因为此文件的开发人员风格差异很大

当然,在这个问题之后,我仍然需要弄清楚如何从普通文件中去除噪音.我尝试使用 Python3 语法来做到这一点,但作为 antlr 的新手,这让我大吃一惊.我无法弄清楚如何编写规则来提取值,所以我决定尝试一种更简单的语法.并迅速撞到另一堵墙.

And of course after this issue, I still need to figure out how to strip the noise from a normal file. I tried using the Python3 grammar to do this, but me being a novice at antlr, it blew me away. i couldn't figure out how to write the rules to pull the values, so I decided to try a far simpler grammar. And quickly hit another wall.

编辑这是一个最终必须解析的实际 setup.py 文件.请记住 setup_requires 和 test_requires 可能存在也可能不存在,并且可能会或可能不会按照该顺序.

edit here is an actual setup.py file that it will eventually have to parse. keeping in mind the setup_requires and test_requires may or may not be there and may or may not be in that order.

# -*- coding: utf-8 -*-
from __future__ import with_statement

from setuptools import setup


def get_version(fname='mccabe.py'):
    with open(fname) as f:
        for line in f:
            if line.startswith('__version__'):
                return eval(line.split('=')[-1])


def get_long_description():
    descr = []
    for fname in ('README.rst',):
        with open(fname) as f:
            descr.append(f.read())
    return '\n\n'.join(descr)


setup(
    name='mccabe',
    version=get_version(),
    description="McCabe checker, plugin for flake8",
    long_description=get_long_description(),
    keywords='flake8 mccabe',
    author='Tarek Ziade',
    author_email='tarek@ziade.org',
    maintainer='Ian Cordasco',
    maintainer_email='graffatcolmingov@gmail.com',
    url='https://github.com/pycqa/mccabe',
    license='Expat license',
    py_modules=['mccabe'],
    zip_safe=False,
    setup_requires=['pytest-runner'],
    tests_require=['pytest'],
    entry_points={
        'flake8.extension': [
            'C90 = mccabe:McCabeChecker',
        ],
    },
    classifiers=[
        'Development Status :: 5 - Production/Stable',
        'Environment :: Console',
        'Intended Audience :: Developers',
        'License :: OSI Approved :: MIT License',
        'Operating System :: OS Independent',
        'Programming Language :: Python',
        'Programming Language :: Python :: 2',
        'Programming Language :: Python :: 2.7',
        'Programming Language :: Python :: 3',
        'Programming Language :: Python :: 3.3',
        'Programming Language :: Python :: 3.4',
        'Programming Language :: Python :: 3.5',
        'Programming Language :: Python :: 3.6',
        'Topic :: Software Development :: Libraries :: Python Modules',
        'Topic :: Software Development :: Quality Assurance',
    ],
)

尝试调试和简化并意识到我不需要找到方法,只需找到值.所以我在玩这个语法

Trying to debug and simplify and realized i don't need to find the method, just the values. so I'm playing with this grammer

grammar SetupPy ;

file_input: (ignore setupRequires ignore | ignore testRequires ignore )* EOF;

setupRequires : 'setup_requires' '=' '[' dependencyValue* (',' dependencyValue)* ']';
testRequires : 'tests_require' '=' '[' dependencyValue* (',' dependencyValue)* ']';

dependencyValue: LISTVAL;

ignore : UNKNOWN_CHAR? ;

LISTVAL: SHORT_STRING;
UNKNOWN_CHAR: . -> channel(HIDDEN);

fragment SHORT_STRING: '\'' ( STRING_ESCAPE_SEQ | ~[\\\r\n\f'] )* '\''
| '"' ( STRING_ESCAPE_SEQ | ~[\\\r\n\f"] )* '"';

fragment STRING_ESCAPE_SEQ
: '\\' .
| '\\'
;

非常适合简单的,甚至可以处理乱序问题.但不适用于完整文件,它会挂在

Works great for the simple one, even handles the out of order issue. but doesnt' work on the full file, it gets hung up on the

def get_version(fname='mccabe.py'):

该行中的等号.

推荐答案

我已经检查了您的语法并对其进行了相当多的简化.我去掉了所有的 python-esqe 空格处理,只是将空格视为空格.这个语法也解析这个输入,正如你在问题中所说的,每行处理一个项目,单引号和双引号等......

I've examined your grammar and simplified it quite a bit. I took out all the python-esqe whitespace handling and just treated whitespace as whitespace. This grammar also parses this input, which as you said in the question, handles one item per line, single and double quotes, etc...

setup(
    setup_requires=['pytest-runner'],
    tests_require=['pytest', 
    'unittest2', 
    "test_3" ],
)

这里是非常简化的语法:

And here's the much simplified grammar:

grammar SetupPy ;
setupDeclaration : 'setup' '(' method ')' EOF;
method : setupRequires testRequires  ;
setupRequires : 'setup_requires' '=' '[' LISTVAL* (',' LISTVAL)* ']' ',' ;
testRequires : 'tests_require' '=' '[' LISTVAL* (',' LISTVAL)* ']' ',' ;
WS: [ \t\n\r]+ -> skip ;
LISTVAL : SHORT_STRING ;
fragment SHORT_STRING
 : '\'' ( STRING_ESCAPE_SEQ | ~[\\\r\n\f'] )* '\''
 | '"' ( STRING_ESCAPE_SEQ | ~[\\\r\n\f"] )* '"'
 ;
fragment STRING_ESCAPE_SEQ
: '\\' .
| '\\' 
;

哦,这里是解析器-词法分析器的输出,显示了正确的标记分配:

Oh and here's the parser-lexer output showing the correct assignment of tokens:

[@0,0:4='setup',<'setup'>,1:0]
[@1,5:5='(',<'('>,1:5]
[@2,12:25='setup_requires',<'setup_requires'>,2:4]
[@3,26:26='=',<'='>,2:18]
[@4,27:27='[',<'['>,2:19]
[@5,28:42=''pytest-runner'',<LISTVAL>,2:20]
[@6,43:43=']',<']'>,2:35]
[@7,44:44=',',<','>,2:36]
[@8,51:63='tests_require',<'tests_require'>,3:4]
[@9,64:64='=',<'='>,3:17]
[@10,65:65='[',<'['>,3:18]
[@11,66:73=''pytest'',<LISTVAL>,3:19]
[@12,74:74=',',<','>,3:27]
[@13,79:89=''unittest2'',<LISTVAL>,4:1]
[@14,90:90=',',<','>,4:12]
[@15,95:102='"test_3"',<LISTVAL>,5:1]
[@16,104:104=']',<']'>,5:10]
[@17,105:105=',',<','>,5:11]
[@18,108:108=')',<')'>,6:0]
[@19,109:108='<EOF>',<EOF>,6:1]

现在您应该能够遵循一个简单的 ANTLR 访问者或监听器模式来获取您的 LISTVAL 令牌并使用它们做您的事情.我希望这能满足您的需求.它当然可以很好地解析您的测试输入,等等.

Now you should be able to follow a simple ANTLR Visitor or Listener pattern to grab up your LISTVAL tokens and do your thing with them. I hope this meets your needs. It certainly parses your test input well, and more.

这篇关于Antlr解析python安装文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆