使用Python优化字符串解析 [英] Optimizing string parsing with Python
问题描述
我的字符串格式为'AB(AB(DDC)C)A(BAAC)DAB(ABC)'
.
- 每个字符代表一个元素(
A
,B
,C
或D
). - 在括号之间,右边是每个元素的子元素(可能不存在).
- Each character represents an element (
A
,B
,C
orD
). - Between parentheses, on the right, there is the child of each element (which may be absent).
例如,具有'AB(AB(DDC)C)A(BAAC)DA'
的顶级将是 AB (AB(DDC)C) A (BAAC) DA -> [A, B, A, D, A]
,相应的子代将是[None, AB(DDC)C, BAAC, None, None]
.还应递归地解析这些子级.
In example, having 'AB(AB(DDC)C)A(BAAC)DA'
, the top level would be AB(AB(DDC)C)A(BAAC)DA --> [A, B, A, D, A]
and the corresponding children would be [None, AB(DDC)C, BAAC, None, None]
. The children are to be parsed as well recursively.
我在这里实现了一个解决方案:
I have implemented a solution here:
def parse_string(string):
i = 0
parsed = []
while i < len(string):
if string[i] in ('A', 'B', 'C', 'D'):
parsed.append([string[i], None])
i += 1
elif string[i] == '(':
open_brakets = 1
i += 1
j = i
while open_brakets:
if string[j] == '(':
open_brakets += 1
elif string[j] == ')':
open_brakets -= 1
j += 1
# Parse the children as well
parsed[-1][-1] = parse_string(string[i:j - 1])
i = j
else:
i += 1
return parsed
print parse_string('AB(AB(DDC)C)A(BAAC)DAB(ABC)')
尽管我认为这有点丑陋,但我确信它不是很有效.
Although I think it's a bit ugly and I'm sure it is not very efficient.
我想知道是否有一种方法可以使Python更简洁,更快捷,更优雅地做到这一点?允许使用外部库(特别是如果它们是用C!:-P编写的).
I wonder if there's a way to make this with Python in a cleaner/faster/more elegant way? Using external libraries is allowed (specially if they're written in C! :-P).
其他应该起作用的字符串示例:
Other examples of strings that should work:
-
ABC(DAB(ACB)BBB(AAA)ABC)DCB
通常,字符串的长度不受限制,子代的数目,长度,嵌套级别的数目也不受限制.
In general, the length of the string is not limited, neither the number of children, nor their length, nor the number of nested levels.
推荐答案
如果您还需要递归地解析内部括号,则:
If you need to recursively parse the inner parentheses as well:
def parse_tree(tree, string, start=0):
index = start
while index < len(string):
current = string[index]
if current == "(":
child = tree[-1][1]
child_parsed = parse_tree(child, string, index+1)
index += child_parsed + 2 # adds 2 for the parentheses
elif current == ")":
break
else:
tree.append((current, []))
index += 1
return index - start
tree = []
print(parse_tree(tree, 'abc(abc(defg)d)de(f)gh'))
可以将其视为状态机.状态机接受节点定义,直到看到一个开放的括号,然后在其中将新上下文(即递归函数调用)推送到解析堆栈以解析括号的内容.解析内部上下文时,右括号会弹出上下文.
The way this works can be thought of like a state machine. The state machine accepts node definitions until it sees an open parentheses, in which it pushes a new context (i.e. a recursive function call) to the parsing stack to parse the content of the parentheses. When parsing the inner context, the close parentheses pops the context.
如果您有更复杂的语法,可以更好地扩展的另一种选择是使用解析库,例如PyParsing:
Another alternative, which can scale better if you have more complex grammars is to use a parsing library like PyParsing:
from pyparsing import OneOrMore, Optional, oneOf, alphas, Word, Group, Forward, Suppress, Dict
# define the grammar
nodes = Forward()
nodeName = oneOf(list(alphas))
nodeChildren = Suppress('(') + Group(nodes) + Suppress( ')')
node = Group(nodeName + Optional(nodeChildren))
nodes <<= OneOrMore(node)
print(nodes.parseString('abc(abc(defg)d)de(f)gh'))
像PyParsing这样的解析库使您可以定义易于阅读的声明性语法.
Parsing libraries like PyParsing allows you to define an easy-to-read declarative grammar.
原始非递归解析的答案:一种方法是使用itertools(仅从Python 3.2及更高版本开始累积,itertools文档具有).这样可以避免使用索引:
Answer to original non-recursive parsing: One way to do this is with itertools (accumulate is only from Python 3.2 and up, the itertools docs has a pure python implementation of accumulate for use in older versions). This avoids the use of indices:
from itertools import takewhile, accumulate
PARENS_MAP = {'(': 1, ')': -1}
def parse_tree(tree, string):
string = iter(string)
while string:
current = next(string)
if current == "(":
child = iter(string)
child = ((c, PARENS_MAP.get(c, 0)) for c in child)
child = accumulate(child, lambda a,b: (b[0], a[1]+b[1]))
child = takewhile(lambda c: c[1] >= 0, child)
child = (c[0] for c in child)
tree[-1][1] = "".join(child)
else:
tree.append([current, None])
print(parse_tree('abc(abc(defg)d)de(f)gh'))
我不确定是更快还是更优雅,但是我认为使用显式索引更容易编写,理解和修改.
I'm not quite sure whether it's faster or more elegant, but I think using explicit indexes is much easier to write, understand, and modify.
这篇关于使用Python优化字符串解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!