从一串 Python 代码(正则表达式或 AST)中提取所有变量 [英] Extract all variables from a string of Python code (regex or AST)

查看:29
本文介绍了从一串 Python 代码(正则表达式或 AST)中提取所有变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在包含 Python 代码的字符串中查找并提取所有变量.我只想提取变量(和带下标的变量)而不是函数调用.

I want to find and extract all the variables in a string that contains Python code. I only want to extract the variables (and variables with subscripts) but not function calls.

例如,来自以下字符串:

For example, from the following string:

code = 'foo + bar[1] + baz[1:10:var1[2+1]] + qux[[1,2,int(var2)]] + bob[len("foobar")] + func() + func2 (var3[0])'

我要提取:foobar[1]baz[1:10:var1[2+1]]var1[2+1], qux[[1,2,int(var2)]], var2, bob[len("foobar")], var3[0].请注意,某些变量可能是嵌套的".例如,从 baz[1:10:var1[2+1]] 我想提取 baz[1:10:var1[2+1]] 和 <代码>var1[2+1].

I want to extract: foo, bar[1], baz[1:10:var1[2+1]], var1[2+1], qux[[1,2,int(var2)]], var2, bob[len("foobar")], var3[0]. Please note that some variables may be "nested". For example, from baz[1:10:var1[2+1]] I want to extract baz[1:10:var1[2+1]] and var1[2+1].

首先想到的两个想法是使用正则表达式或 AST.我都试过了,但都没有成功.

The first two ideas that come to mind is to use either a regex or an AST. I have tried both but with no success.

在使用正则表达式时,为了使事情更简单,我认为最好先提取顶级"变量,然后递归地提取嵌套的变量.不幸的是,我什至不能这样做.

When using a regex, in order to make things simpler, I thought it would be a good idea to first extract the "top level" variables, and then recursively the nested ones. Unfortunately, I can't even do that.

这是我目前所拥有的:

regex = r'[_a-zA-Z]\w*\s*(\[.*\])?'
for match in re.finditer(regex, code):
    print(match)

这是一个演示:https://regex101.com/r/INPRdN/2

另一种解决方案是使用 AST,扩展 ast.NodeVisitor,并实现 visit_Namevisit_Subscript 方法.然而,这也不起作用,因为 visit_Name 也被函数调用.

The other solution is to use an AST, extend ast.NodeVisitor, and implement the visit_Name and visit_Subscript methods. However, this doesn't work either because visit_Name is also called for functions.

如果有人能为我提供解决此问题的解决方案(正则表达式或 AST),我将不胜感激.

I would appreciate if someone could provide me with a solution (regex or AST) to this problem.

谢谢.

推荐答案

我发现你的问题是一个有趣的挑战,所以这里有一个代码可以做你想做的事,单独使用 Regex 这样做是不可能的因为有嵌套表达式,这是一个结合使用 Regex 和字符串操作来处理嵌套表达式的解决方案:

I find your question an interesting challenge, so here is a code that do what you want, doing this using Regex alone it's impossible because there is nested expression, this is a solution using a combination of Regex and string manipulations to handle nested expressions:

# -*- coding: utf-8 -*-
import re
RE_IDENTIFIER = r'\b[a-z]\w*\b(?!\s*[\[\("\'])'
RE_INDEX_ONLY = re.compile(r'(##)(\d+)(##)')
RE_INDEX = re.compile('##\d+##')


def extract_expression(string):
    """ extract all identifier and getitem expression in the given order."""

    def remove_brackets(text):
        # 1. handle `[...]` expression replace them with #{#...#}#
        # so we don't confuse them with word[...]
        pattern = '(?<!\w)(\s*)(\[)([^\[]+?)(\])'
        # keep extracting expression until there is no expression
        while re.search(pattern, text):
            text = re.sub(pattern, r'\1#{#\3#}#', string)
        return text

    def get_ordered_subexp(exp):
        """ get index of nested expression."""
        index = int(exp.replace('#', ''))
        subexp = RE_INDEX.findall(expressions[index])
        if not subexp:
            return exp
        return exp + ''.join(get_ordered_subexp(i) for i in subexp)

    def replace_expression(match):
        """ save the expression in the list, replace it with special key and it's index in the list."""
        match_exp = match.group(0)
        current_index = len(expressions)
        expressions.append(None)  # just to make sure the expression is inserted before it's inner identifier
        # if the expression contains identifier extract too.
        if re.search(RE_IDENTIFIER, match_exp) and '[' in match_exp:
            match_exp = re.sub(RE_IDENTIFIER, replace_expression, match_exp)
        expressions[current_index] = match_exp
        return '##{}##'.format(current_index)

    def fix_expression(match):
        """ replace the match by the corresponding expression using the index"""
        return expressions[int(match.group(2))]

    # result that will contains
    expressions = []

    string = remove_brackets(string)

    # 2. extract all expression and keep track of there place in the original code
    pattern = r'\w+\s*\[[^\[]+?\]|{}'.format(RE_IDENTIFIER)
    # keep extracting expression until there is no expression
    while re.search(pattern, string):
        # every exression that is extracted is replaced by a special key
        string = re.sub(pattern, replace_expression, string)
        # some times inside brackets can contains getitem expression
        # so when we extract that expression we handle the brackets
        string = remove_brackets(string)

    # 3. build the correct result with extracted expressions
    result = [None] * len(expressions)
    for index, exp in enumerate(expressions):
        # keep replacing special keys with the correct expression
        while RE_INDEX_ONLY.search(exp):
            exp = RE_INDEX_ONLY.sub(fix_expression, exp)
        # finally we don't forget about the brackets
        result[index] = exp.replace('#{#', '[').replace('#}#', ']')

    # 4. Order the index that where extracted
    ordered_index = ''.join(get_ordered_subexp(exp) for exp in RE_INDEX.findall(string))
    # convert it to integer
    ordered_index = [int(index[1]) for index in RE_INDEX_ONLY.findall(ordered_index)]

    # 5. fix the order of expressions using the ordered indexes
    final_result = []
    for exp_index in ordered_index:
        final_result.append(result[exp_index])

    # for debug:
    # print('final string:', string)
    # print('expression :', expressions)
    # print('order_of_expresion: ', ordered_index)
    return final_result


code = 'foo + bar[1] + baz[1:10:var1[2+1]] + qux[[1,2,int(var2)]] + bob[len("foobar")] + func() + func2 (var3[0])'
code2 = 'baz[1:10:var1[2+1]]'
code3 = 'baz[[1]:10:var1[2+1]:[var3[3+1*x]]]'
print(extract_expression(code))
print(extract_expression(code2))
print(extract_expression(code3))

输出:

['foo', 'bar[1]', 'baz[1:10:var1[2+1]]', 'var1[2+1]', 'qux[[1,2,int(var2)]]', 'var2', 'bob[len("foobar")]', 'var3[0]']
['baz[1:10:var1[2+1]]', 'var1[2+1]']
['baz[[1]:10:var1[2+1]:[var3[3+1*x]]]', 'var1[2+1]', 'var3[3+1*x]', 'x']

我针对非常复杂的示例测试了此代码,并且运行良好.并注意如果提取的顺序与您想要的相同,希望这是您需要的.

I tested this code for very complicated examples and it worked perfectly. and notice that the order if extraction is the same as you wanted, Hope that this is what you needed.

这篇关于从一串 Python 代码(正则表达式或 AST)中提取所有变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆