使用PyParsing进行增量但完整的解析? [英] Incremental but complete parsing with PyParsing?

查看:156
本文介绍了使用PyParsing进行增量但完整的解析?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用PyParsing解析一些类似C格式的相当大的文本文件(大括号和分号等等.

I'm using PyParsing to parse some rather large text files with a C-like format (braces and semicolons and all that).

PyParsing可以很好地工作,但是由于文件的大小,它运行缓慢并且消耗大量内存.

PyParsing works just great, but it is slow and consumes a very large amount of memory due to the size of my files.

因此,我想尝试实现一种增量解析方法,在该方法中,我将一个一个地解析源文件的顶级元素. pyparsing的scanString方法似乎是实现此目的的明显方法.但是,我想确保在scanString所解析的各节之间没有无效/无法解析的文本,并且找不到解决此问题的好方法.

Because of this, I wanted to try to implement an incremental parsing approach wherein I'd parse the top-level elements of the source file one-by-one. The scanString method of pyparsing seems like the obvious way to do this. However, I want to make sure that there is no invalid/unparseable text in-between the sections parsed by scanString, and can't figure out a good way to do this.

这是一个简化的示例,显示了我遇到的问题:

Here's a simplified example that shows the problem I'm having:

sample="""f1(1,2,3); f2_no_args( );
# comment out: foo(4,5,6);
bar(7,8);
this should be an error;
baz(9,10);
"""

from pyparsing import *

COMMENT=Suppress('#' + restOfLine())
SEMI,COMMA,LPAREN,RPAREN = map(Suppress,';,()')

ident = Word(alphas, alphanums+"_")
integer = Word(nums+"+-",nums)

statement = ident("fn") + LPAREN + Group(Optional(delimitedList(integer)))("arguments") + RPAREN + SEMI

p = statement.ignore(COMMENT)

for res, start, end in p.scanString(sample):
    print "***** (%d,%d)" % (start, end)
    print res.dump()

输出:

***** (0,10)
['f1', ['1', '2', '3']]
- arguments: ['1', '2', '3']
- fn: f1
***** (11,25)
['f2_no_args', []]
- arguments: []
- fn: f2_no_args
***** (53,62)
['bar', ['7', '8']]
- arguments: ['7', '8']
- fn: bar
***** (88,98)
['baz', ['9', '10']]
- arguments: ['9', '10']
- fn: baz

由于scanString返回的范围之间存在未解析的文本,因此它们之间存在间隙((0,10),(11,25),(53,62),(88,98)).这些空白中的两个是空格或注释,它们不应触发错误,但是其中一个(this should be an error;)包含无法解析的文本,我想抓住它们.

The ranges returned by scanString have gaps due to unparsed text between them ((0,10),(11,25),(53,62),(88,98)). Two of these gaps are whitespace or comments, which should not trigger an error, but one of them (this should be an error;) contains unparsable text, which I want to catch.

是否有一种方法可以使用pyparsing逐步解析文件,同时仍然确保可以使用指定的解析器语法来解析整个输入?

推荐答案

在简短讨论我略微修改了ParserElement.parseString方法以提供parseConsumeString,它可以实现我想要的功能.此版本反复调用ParserElement._parseParserElement.preParse.

I modified the ParserElement.parseString method slightly to come up with parseConsumeString, which does about what I want. This version calls ParserElement._parse followed by ParserElement.preParse repeatedly.

这是使用parseConsumeString方法猴子补丁ParserElement的代码:

Here is code to monkey-patch ParserElement with the parseConsumeString method:

from pyparsing import ParseBaseException, ParserElement

def parseConsumeString(self, instring, parseAll=True, yieldLoc=False):
    '''Generator version of parseString which does not try to parse
    the whole string at once.

    Should be called with a top-level parser that could parse the
    entire string if called repeatedly on the remaining pieces.
    Instead of:

        ZeroOrMore(TopLevel)).parseString(s ...)

    Use:

        TopLevel.parseConsumeString(s ...)

    If yieldLoc==True, it will yield a tuple of (tokens, startloc, endloc).
    If False, it will yield only tokens (like parseString).

    If parseAll==True, it will raise an error as soon as a parse
    error is encountered. If False, it will return as soon as a parse
    error is encountered (possibly before yielding any tokens).'''

    if not self.streamlined:
        self.streamline()
        #~ self.saveAsList = True
    for e in self.ignoreExprs:
        e.streamline()
    if not self.keepTabs:
        instring = instring.expandtabs()
    try:
        sloc = loc = 0
        while loc<len(instring):
            # keeping the cache (if in use) across loop iterations wastes memory (can't backtrack outside of loop)
            ParserElement.resetCache()
            loc, tokens = self._parse(instring, loc)
            if yieldLoc:
                yield tokens, sloc, loc
            else:
                yield tokens
            sloc = loc = self.preParse(instring, loc)
    except ParseBaseException as exc:
        if not parseAll:
            return
        elif ParserElement.verbose_stacktrace:
            raise
        else:
            # catch and re-raise exception from here, clears out pyparsing internal stack trace
            raise exc

def monkey_patch():
    ParserElement.parseConsumeString = parseConsumeString

请注意,我还将调用移至ParserElement.resetCache到每个循环迭代中.因为不可能回溯到每个循环之外,所以不需要在迭代之间保留高速缓存.当使用PyParsing的 packrat缓存功能.在我使用10 MiB输入文件的测试中,峰值内存消耗从〜6G下降到了〜100M峰值,而运行速度却快了15-20%.

Notice that I also moved the call to ParserElement.resetCache into each loop iteration. Because it's impossible to backtrack out of each loop, there's no need to retain the cache across iterations. This drastically reduce memory consumption when using PyParsing's packrat caching feature. In my tests with a 10 MiB input file, peak memory consumption goes down from ~6G to ~100M peak, while running about 15-20% faster.

这篇关于使用PyParsing进行增量但完整的解析?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆