使用Python优化字符串解析 [英] Optimizing string parsing with Python

查看:105
本文介绍了使用Python优化字符串解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的字符串格式为'AB(AB(DDC)C)A(BAAC)DAB(ABC)'.

  • 每个字符代表一个元素(ABCD).
  • 在括号之间,右边是每个元素的子元素(可能不存在).
  • Each character represents an element (A, B, C or D).
  • Between parentheses, on the right, there is the child of each element (which may be absent).

例如,具有'AB(AB(DDC)C)A(BAAC)DA'的顶级将是 AB (AB(DDC)C) A (BAAC) DA -> [A, B, A, D, A],相应的子代将是[None, AB(DDC)C, BAAC, None, None].还应递归地解析这些子级.

In example, having 'AB(AB(DDC)C)A(BAAC)DA', the top level would be AB(AB(DDC)C)A(BAAC)DA --> [A, B, A, D, A] and the corresponding children would be [None, AB(DDC)C, BAAC, None, None]. The children are to be parsed as well recursively.

我在这里实现了一个解决方案:

I have implemented a solution here:

def parse_string(string):

    i = 0                                                                       
    parsed = []                                                                 

    while i < len(string):                                                      
        if string[i] in ('A', 'B', 'C', 'D'):                                        
            parsed.append([string[i], None])                                    
            i += 1                                                              
        elif string[i] == '(':                                                  
            open_brakets = 1                                                    
            i += 1                                                              
            j = i                                                               
            while open_brakets:                                                 
                if string[j] == '(':                                            
                    open_brakets += 1                                           
                elif string[j] == ')':                   
                    open_brakets -= 1                    
                j += 1
            # Parse the children as well
            parsed[-1][-1] = parse_string(string[i:j - 1])       
            i = j                                                               
        else:                                                                   
            i += 1                                                              

    return parsed

print parse_string('AB(AB(DDC)C)A(BAAC)DAB(ABC)') 

尽管我认为这有点丑陋,但我确信它不是很有效.

Although I think it's a bit ugly and I'm sure it is not very efficient.

我想知道是否有一种方法可以使Python更简洁,更快捷,更优雅地做到这一点?允许使用外部库(特别是如果它们是用C!:-P编写的).

I wonder if there's a way to make this with Python in a cleaner/faster/more elegant way? Using external libraries is allowed (specially if they're written in C! :-P).

其他应该起作用的字符串示例:

Other examples of strings that should work:

  • ABC(DAB(ACB)BBB(AAA)ABC)DCB

通常,字符串的长度不受限制,子代的数目,长度,嵌套级别的数目也不受限制.

In general, the length of the string is not limited, neither the number of children, nor their length, nor the number of nested levels.

推荐答案

如果您还需要递归地解析内部括号,则:

If you need to recursively parse the inner parentheses as well:

def parse_tree(tree, string, start=0):
    index = start
    while index < len(string):
        current = string[index]
        if current == "(":
            child = tree[-1][1]
            child_parsed = parse_tree(child, string, index+1)
            index += child_parsed + 2 # adds 2 for the parentheses
        elif current == ")":
            break
        else:
            tree.append((current, []))
            index += 1
    return index - start
tree = []
print(parse_tree(tree, 'abc(abc(defg)d)de(f)gh'))

可以将其视为状态机.状态机接受节点定义,直到看到一个开放的括号,然后在其中将新上下文(即递归函数调用)推送到解析堆栈以解析括号的内容.解析内部上下文时,右括号会弹出上下文.

The way this works can be thought of like a state machine. The state machine accepts node definitions until it sees an open parentheses, in which it pushes a new context (i.e. a recursive function call) to the parsing stack to parse the content of the parentheses. When parsing the inner context, the close parentheses pops the context.

如果您有更复杂的语法,可以更好地扩展的另一种选择是使用解析库,例如PyParsing:

Another alternative, which can scale better if you have more complex grammars is to use a parsing library like PyParsing:

from pyparsing import OneOrMore, Optional, oneOf, alphas, Word, Group, Forward, Suppress, Dict

# define the grammar
nodes = Forward()
nodeName = oneOf(list(alphas))
nodeChildren = Suppress('(') + Group(nodes) + Suppress( ')')
node = Group(nodeName + Optional(nodeChildren))
nodes <<= OneOrMore(node)

print(nodes.parseString('abc(abc(defg)d)de(f)gh'))

像PyParsing这样的解析库使您可以定义易于阅读的声明性语法.

Parsing libraries like PyParsing allows you to define an easy-to-read declarative grammar.

原始非递归解析的答案:一种方法是使用itertools(仅从Python 3.2及更高版本开始累积,itertools文档具有).这样可以避免使用索引:

Answer to original non-recursive parsing: One way to do this is with itertools (accumulate is only from Python 3.2 and up, the itertools docs has a pure python implementation of accumulate for use in older versions). This avoids the use of indices:

from itertools import takewhile, accumulate
PARENS_MAP = {'(': 1, ')': -1}
def parse_tree(tree, string):
    string = iter(string)
    while string:
        current = next(string)
        if current == "(":
            child = iter(string)
            child = ((c, PARENS_MAP.get(c, 0)) for c in child)
            child = accumulate(child, lambda a,b: (b[0], a[1]+b[1]))
            child = takewhile(lambda c: c[1] >= 0, child)
            child = (c[0] for c in child)
            tree[-1][1] = "".join(child)
        else:
            tree.append([current, None])
print(parse_tree('abc(abc(defg)d)de(f)gh'))

我不确定是更快还是更优雅,但是我认为使用显式索引更容易编写,理解和修改.

I'm not quite sure whether it's faster or more elegant, but I think using explicit indexes is much easier to write, understand, and modify.

这篇关于使用Python优化字符串解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆