编辑pyparsing解析结果 [英] Editing pyparsing parse results

查看:69
本文介绍了编辑pyparsing解析结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这类似于我之前问过的问题.

我为包含多个日志的文本文件编写了 pyparsing 语法 logparser.日志记录每个函数调用和每个函数完成.底层进程是多线程的,所以有可能调用慢函数A,然后调用快速函数B并几乎立即完成,然后在该函数>A 完成并给我们它的返回值.因此,日志文件很难手动读取,因为一个函数的调用信息和返回值信息可能相隔数千行.

我的解析器能够解析函数调用(从现在开始称为 input_blocks)及其返回值(从现在开始称为 output_blocks).我的解析结果 (logparser.searchString(logfile)) 如下所示:

[0]: # 第一个日志- 输入块:[0]:- func_name: 'Foo'- 参数: ...- 线程:'123'- 时间戳输入:'12:01'[1]:- func_name: '酒吧'- 参数: ...- 线程:'456'- 时间戳输入:'12:02'- 输出块:[0]:- func_name: '酒吧'- func_time:'1'- 参数: ...- 线程:'456'- 时间戳输出:'12:03'[1]:- func_name: 'Foo'- func_time:'3'- 参数: ...- 线程:'123'- 时间戳输出:'12:04'[1]: # 第二个日志- 输入块:...- 输出块:...... # 第 n 个日志

我想解决一个函数调用的输入输出信息分离的问题.所以我想把一个input_block 和对应的output_block 放到一个function_block 中.我的最终解析结果应如下所示:

[0]: # 第一个日志- 功能块:[0]:- 输入块:- func_name: 'Foo'- 参数: ...- 线程:'123'- 时间戳输入:'12:01'- 输出块:- func_name: 'Foo'- func_time:'3'- 参数: ...- 线程:'123'- 时间戳输出:'12:04'[1]:- 输入块:- func_name: '酒吧'- 参数: ...- 线程:'456'- 时间戳输入:'12:02'- 输出块:- func_name: '酒吧'- func_time:'1'- 参数: ...- 线程:'456'- 时间戳输出:'12:03'[1]: # 第二个日志- 功能块:[0]:...[1]:...... # 第 n 个日志

为了实现这一点,我定义了一个函数rearrange,它遍历input_blocksoutput_blocks 并检查func_namethread 和时间戳匹配.但是,将匹配块移动到一个 function_block 中是我缺少的部分.然后我将此函数设置为日志语法的解析操作:logparser.setParseAction(rearrange)

def 重新排列(log_token):对于 log_token.input_blocks 中的 input_block:对于 log_token.output_blocks 中的 output_block:if (output_block.func_name == input_block.func_name和 output_block.thread == input_block.thread和 check_timestamp(output_block.timestamp_out,output_block.func_time,input_block.timestamp_in):# output_block 和 input_block 匹配 ->将它们放在 function_block 中# 修改 log_token返回 log_token

我的问题是:如何将匹配的 output_blockinput_block 以某种方式放入 function_block我仍然喜欢 pyparsing.ParseResults 的便捷访问方法吗?

我的想法是这样的:

def 重新排列(log_token):# 定义一个新的 ParseResults 对象,我在其中存储匹配的输入 &输出块function_blocks = pp.ParseResults(name='function_blocks')# 找到匹配块对于 log_token.input_blocks 中的 input_block:对于 log_token.output_blocks 中的 output_block:if (output_block.func_name == input_block.func_name和 output_block.thread == input_block.thread和 check_timestamp(output_block.timestamp_out,output_block.func_time,input_block.timestamp_in):# output_block 和 input_block 匹配 ->将它们放在 function_block 中function_blocks.append(input_block.pop() + output_block.pop()) # 这个加法导致最大递归错误?log_token.append(function_blocks)返回 log_token

虽然这行不通.添加会导致最大递归错误,并且 .pop() 无法按预期工作.它不会弹出整个块,它只是弹出该块中的最后一个条目.此外,它实际上也不会删除该条目,只是将其从列表中删除,但仍然可以通过其结果名称访问它.

也有可能某些input_blocks 没有对应的output_block(例如,如果进程在所有功能完成之前崩溃).所以我的解析结果应该有属性input_blocksoutput_blocks(对于备用块)和function_blocks(对于匹配块).>

感谢您的帮助!

我做了一个更简单的例子来说明我的问题.此外,我进行了试验,并有一个解决方案,该解决方案有效但有点混乱.我必须承认,其中包含了很多反复试验,因为我既没有找到有关 ParseResults 的内部工作原理的文档,也没有弄明白如何正确创建我自己的嵌套 ParseResults-结构.

from pyparsing import *定义主():日志数据 = '''\func1_infunc2_infunc2_outfunc1_outfunc3_in'''ParserElement.inlineLiteralsUsing(Suppress)input_block = Group(Word(alphanums)('func_name') + '_in').setResultsName('input_blocks', listAllMatches=True)output_block = Group(Word(alphanums)('func_name') +'_out').setResultsName('output_blocks', listAllMatches=True)log = OneOrMore(input_block | output_block)parse_results = log.parseString(log_data)print('***** 重新排列前 *****')打印(解析结果.转储())解析结果 = 重新排列(解析结果)print('***** 重新排列后 *****')打印(解析结果.转储())定义重新排列(log_token):function_blocks = list()对于 log_token.input_blocks 中的 input_block:对于 log_token.output_blocks 中的 output_block:如果 input_block.func_name == output_block.func_name:# 找到两个匹配的块!现在把它们放在一个 function_block 中# 并将它们从 log_token 中的原始位置删除# 我必须同时执行 __setitem__ 和 .append 以便它显示在字典和列表中# 和 .copy() 是必要的,因为我稍后删除原始对象tmp_function_block = ParseResults()tmp_function_block.__setitem__('input', input_block.copy())tmp_function_block.append(input_block.copy())tmp_function_block.__setitem__('output', output_block.copy())tmp_function_block.append(output_block.copy())function_block = ParseResults(name='function_blocks', toklist=tmp_function_block, asList=True,modal=False) # 我不知道 modal 和 asList 是做什么的,这是反复试验,直到我得到可接受的输出del function_block['input'], function_block['output'] # 去除重复数据function_blocks.append(function_block)# 从 log_token 中的原始位置删除input_block.clear()output_block.clear()log_token.__setitem__('function_blocks', sum(function_blocks))返回 log_token如果 __name__ == '__main__':主要的()

输出:

***** 重排前 *****[['Func1'], ['Func2'], ['Func2'], ['Func1'], ['Func3']]- input_blocks: [['Func1'], ['Func2'], ['Func3']][0]:['Func1']- func_name: 'Func1'[1]:['Func2']- func_name: 'Func2'[2]:['Func3']- func_name: 'Func3'- output_blocks: [['Func2'], ['Func1']][0]:['Func2']- func_name: 'Func2'[1]:['Func1']- func_name: 'Func1'***** 重新排列后 *****[[], [], [], [], ['Func3']]- function_blocks: [['Func1'], ['Func1'], ['Func2'], ['Func2'], [], []] # 为什么这是重复的?我只想要内部的function_blocks!- function_blocks: [[['Func1'], ['Func1']], [['Func2'], ['Func2']], [[], []]][0]:[['Func1'], ['Func1']]- 输入:['Func1']- func_name: 'Func1'- 输出:['Func1']- func_name: 'Func1'[1]:[['Func2'], ['Func2']]- 输入:['Func2']- func_name: 'Func2'- 输出:['Func2']- func_name: 'Func2'[2]: # 这是从哪里来的?[[], []]- 输入: []- 输出: []- input_blocks: [[], [], ['Func3']][0]: # 如何删除这些索引?[] # 我想我只是清除了他们的内容[1]:[][2]:['Func3']- func_name: 'Func3'- output_blocks: [[], []][0]:[][1]:[]

此版本的 rearrange 解决了我在您的示例中看到的大多数问题:

def 重新排列(log_token):function_blocks = list()对于 log_token.input_blocks 中的 input_block:# 在未清除的输出块之间寻找匹配对于过滤器中的输出块(无,log_token.output_blocks):如果 input_block.func_name == output_block.func_name:# 找到两个匹配的块!现在把它们放在一个 function_block 中# 并将它们从 log_token 中的原始位置清除# 创建重新排列的块,首先是两个块的列表# 而不是 append()'ing,只需使用包含的列表进行初始化# 两个区块副本tmp_function_block = ParseResults([input_block.copy(), output_block.copy()])# 现在按名称分配块# x.__setitem__(key, value) 等同于 x[key] = valuetmp_function_block['输入'] = tmp_function_block[0]tmp_function_block['输出'] = tmp_function_block[1]# 将所有内容包装在另一个 ParseResults 中,就好像我们匹配了一个 Groupfunction_block = ParseResults(name='function_blocks', toklist=tmp_function_block, asList=True,modal=False) # 我不知道 modal 和 asList 是做什么的,这是反复试验,直到我得到可接受的输出del function_block['input'], function_block['output'] # 删除重复名称引用function_blocks.append(function_block)# 清除 log_token 中原始位置的块,使它们不再匹配input_block.clear()output_block.clear()# 找到匹配,无需继续寻找匹配的输出块休息# 找到所有未被清除的输入块(具有匹配的输出块)并作为仅输入块追加对于过滤器中的 input_block(无,log_token.input_blocks):# 这个输入没有匹配的输出tmp_function_block = ParseResults([input_block.copy()])tmp_function_block['输入'] = tmp_function_block[0]function_block = ParseResults(name='function_blocks', toklist=tmp_function_block, asList=True,modal=False) # 我不知道 modal 和 asList 是做什么的,这是反复试验,直到我得到可接受的输出del function_block['input'] # 删除重复数据function_blocks.append(function_block)input_block.clear()# 清除 log_token,并重新加载重新排列的功能块log_token.clear()log_token.extend(function_blocks)log_token['function_blocks'] = sum(function_blocks)返回 log_token

由于这需要输入标记并返回重新排列的标记,您可以使其成为解析操作:

 # 结果名称后面的 '*' 等价于 listAllMatches=Trueinput_block = Group(Word(alphanums)('func_name') + '_in')('input_blocks*')output_block = Group(Word(alphanums)('func_name') +'_out')('output_blocks*')log = OneOrMore(input_block | output_block)log.addParseAction(重新排列)

由于 rearrange 更新了 log_token 到位,如果你让它成为一个解析动作,结束的 return 语句将是不必要的.

有趣的是,您能够通过清除找到匹配项的那些块来就地更新列表 - 非常聪明.

通常,将标记组装到 ParseResults 是一个内部函数,因此文档对这个主题很简单.我只是在浏览模块文档,我真的没有看到这个主题的好地方.

This is similar to a question I've asked before.

I have written a pyparsing grammar logparser for a text file which contains multiple logs. A log documents every function call and every function completion. The underlying process is multithreaded, so it is possible that a slow function A is called, then a fast function B is called and finishes almost immediately, and after that function A finishes and gives us its return value. Due to this, the log file is very difficult to read by hand because the call information and return value information of one function can be thousands of lines apart.

My parser is able to parse the function calls (from now on called input_blocks) and their return values (from now on called output_blocks). My parse results (logparser.searchString(logfile)) look like this:

[0]:                            # first log
  - input_blocks:
    [0]:
      - func_name: 'Foo'
      - parameters: ...
      - thread: '123'
      - timestamp_in: '12:01'
    [1]:
      - func_name: 'Bar'
      - parameters: ...
      - thread: '456'
      - timestamp_in: '12:02'
  - output_blocks:
    [0]:
      - func_name: 'Bar'
      - func_time: '1'
      - parameters: ...
      - thread: '456'
      - timestamp_out: '12:03'
    [1]:
      - func_name: 'Foo'
      - func_time: '3'
      - parameters: ...
      - thread: '123'
      - timestamp_out: '12:04'
[1]:                            # second log
    - input_blocks:
    ...

    - output_blocks:
    ...
...                             # n-th log

I want to solve the problem that input and output information of one function call are separated. So I want to put an input_block and the corresponding output_block into a function_block. My final parse results should look like this:

[0]:                            # first log
  - function_blocks:
    [0]:
        - input_block:
            - func_name: 'Foo'
            - parameters: ...
            - thread: '123'
            - timestamp_in: '12:01'
        - output_block:
            - func_name: 'Foo'
            - func_time: '3'
            - parameters: ...
            - thread: '123'
            - timestamp_out: '12:04'
    [1]:
        - input_block:
            - func_name: 'Bar'
            - parameters: ...
            - thread: '456'
            - timestamp_in: '12:02'
        - output_block:
            - func_name: 'Bar'
            - func_time: '1'
            - parameters: ...
            - thread: '456'
            - timestamp_out: '12:03'
[1]:                            # second log
    - function_blocks:
    [0]: ...
    [1]: ...
...                             # n-th log

To achieve this, I define a function rearrange which iterates through input_blocks and output_blocks and checks whether func_name, thread, and the timestamps match. However, moving the matching blocks into one function_block is the part I am missing. I then set this function as parse action for the log grammar: logparser.setParseAction(rearrange)

def rearrange(log_token):
    for input_block in log_token.input_blocks:
        for output_block in log_token.output_blocks:
            if (output_block.func_name == input_block.func_name
                and output_block.thread == input_block.thread
                and check_timestamp(output_block.timestamp_out,
                                    output_block.func_time,
                                    input_block.timestamp_in):
                # output_block and input_block match -> put them in a function_block
                # modify log_token
    return log_token

My question is: How do I put the matching output_block and input_block in a function_block in a way that I still enjoy the easy access methods of pyparsing.ParseResults?

My idea looks like this:

def rearrange(log_token):
    # define a new ParseResults object in which I store matching input & output blocks
    function_blocks = pp.ParseResults(name='function_blocks')

    # find matching blocks
    for input_block in log_token.input_blocks:
        for output_block in log_token.output_blocks:
            if (output_block.func_name == input_block.func_name
                and output_block.thread == input_block.thread
                and check_timestamp(output_block.timestamp_out,
                                    output_block.func_time,
                                    input_block.timestamp_in):
                # output_block and input_block match -> put them in a function_block
                function_blocks.append(input_block.pop() + output_block.pop())  # this addition causes a maximum recursion error?
    log_token.append(function_blocks)
    return log_token

This doesn't work though. The addition causes a maximum recursion error and the .pop() doesn't work as expected. It doesn't pop the whole block, it just pops the last entry in that block. Also, it doesn't actually remove that entry either, it justs removes it from the list, but it's still accessible by its results name.

It's also possible that some of theinput_blocks don't have a corresponding output_block (for example if the process crashes before all functions can finish). So my parse results should have the attributes input_blocks, output_blocks (for the spare blocks), and function_blocks (for the matching blocks).

Thanks for your help!

EDIT:

I made a simpler example to show my problem. Also, I experimented around and have a solution which kind of works but is a bit messy. I must admit there was a lot of trial-and-error included because I neither found documentation on nor can make sense of the inner workings of ParseResults and how to properly create my own nested ParseResults-structure.

from pyparsing import *

def main():
    log_data = '''\
    Func1_in
    Func2_in
    Func2_out
    Func1_out
    Func3_in'''

    ParserElement.inlineLiteralsUsing(Suppress)
    input_block = Group(Word(alphanums)('func_name') + '_in').setResultsName('input_blocks', listAllMatches=True)
    output_block = Group(Word(alphanums)('func_name') +'_out').setResultsName('output_blocks', listAllMatches=True)
    log = OneOrMore(input_block | output_block)

    parse_results = log.parseString(log_data)
    print('***** before rearranging *****')
    print(parse_results.dump())

    parse_results = rearrange(parse_results)
    print('***** after rearranging *****')
    print(parse_results.dump())

def rearrange(log_token):
    function_blocks = list()

    for input_block in log_token.input_blocks:
        for output_block in log_token.output_blocks:
            if input_block.func_name == output_block.func_name:
              # found two matching blocks! now put them in a function_block
              # and delete them from their original positions in log_token
                # I have to do both __setitem__ and .append so it shows up in the dict and in the list
                # and .copy() is necessary because I delete the original objects later
                tmp_function_block = ParseResults()
                tmp_function_block.__setitem__('input', input_block.copy())
                tmp_function_block.append(input_block.copy())
                tmp_function_block.__setitem__('output', output_block.copy())
                tmp_function_block.append(output_block.copy())
                function_block = ParseResults(name='function_blocks', toklist=tmp_function_block, asList=True,
                                              modal=False)  # I have no idea what modal and asList do, this was trial-and-error until I got acceptable output
                del function_block['input'], function_block['output']  # remove duplicate data

                function_blocks.append(function_block)
                # delete from original position in log_token
                input_block.clear()
                output_block.clear()
    log_token.__setitem__('function_blocks', sum(function_blocks))
    return log_token


if __name__ == '__main__':
    main()

Output:

***** before rearranging *****
[['Func1'], ['Func2'], ['Func2'], ['Func1'], ['Func3']]
- input_blocks: [['Func1'], ['Func2'], ['Func3']]
  [0]:
    ['Func1']
    - func_name: 'Func1'
  [1]:
    ['Func2']
    - func_name: 'Func2'
  [2]:
    ['Func3']
    - func_name: 'Func3'
- output_blocks: [['Func2'], ['Func1']]
  [0]:
    ['Func2']
    - func_name: 'Func2'
  [1]:
    ['Func1']
    - func_name: 'Func1'
***** after rearranging *****
[[], [], [], [], ['Func3']]
- function_blocks: [['Func1'], ['Func1'], ['Func2'], ['Func2'], [], []]   # why is this duplicated? I just want the inner function_blocks!
  - function_blocks: [[['Func1'], ['Func1']], [['Func2'], ['Func2']], [[], []]]
    [0]:
      [['Func1'], ['Func1']]
      - input: ['Func1']
        - func_name: 'Func1'
      - output: ['Func1']
        - func_name: 'Func1'
    [1]:
      [['Func2'], ['Func2']]
      - input: ['Func2']
        - func_name: 'Func2'
      - output: ['Func2']
        - func_name: 'Func2'
    [2]:                              # where does this come from?
      [[], []]
      - input: []
      - output: []
- input_blocks: [[], [], ['Func3']]
  [0]:                                # how do I delete these indexes?
    []                                #  I think I only cleared their contents
  [1]:
    []
  [2]:
    ['Func3']
    - func_name: 'Func3'
- output_blocks: [[], []]
  [0]:
    []
  [1]:
    []

解决方案

This version of rearrange addresses most of the issues I see in your example:

def rearrange(log_token):
    function_blocks = list()

    for input_block in log_token.input_blocks:
        # look for match among output blocks that have not been cleared
        for output_block in filter(None, log_token.output_blocks):

            if input_block.func_name == output_block.func_name:
                # found two matching blocks! now put them in a function_block
                # and clear them from in their original positions in log_token

                # create rearranged block, first with a list of the two blocks
                # instead of append()'ing, just initialize with a list containing
                # the two block copies
                tmp_function_block = ParseResults([input_block.copy(), output_block.copy()])

                # now assign the blocks by name
                # x.__setitem__(key, value) is the same as x[key] = value
                tmp_function_block['input'] = tmp_function_block[0]
                tmp_function_block['output'] = tmp_function_block[1]

                # wrap that all in another ParseResults, as if we had matched a Group
                function_block = ParseResults(name='function_blocks', toklist=tmp_function_block, asList=True,
                                              modal=False)  # I have no idea what modal and asList do, this was trial-and-error until I got acceptable output

                del function_block['input'], function_block['output']  # remove duplicate name references

                function_blocks.append(function_block)
                # clear blocks in their original positions in log_token, so they won't be matched any more
                input_block.clear()
                output_block.clear()

                # match found, no need to keep going looking for a matching output block 
                break

    # find all input blocks that weren't cleared (had matching output blocks) and append as input-only blocks
    for input_block in filter(None, log_token.input_blocks):
        # no matching output for this input
        tmp_function_block = ParseResults([input_block.copy()])
        tmp_function_block['input'] = tmp_function_block[0]
        function_block = ParseResults(name='function_blocks', toklist=tmp_function_block, asList=True,
                                      modal=False)  # I have no idea what modal and asList do, this was trial-and-error until I got acceptable output
        del function_block['input']  # remove duplicate data
        function_blocks.append(function_block)
        input_block.clear()

    # clean out log_token, and reload with rearranged function blocks
    log_token.clear()
    log_token.extend(function_blocks)
    log_token['function_blocks'] =  sum(function_blocks)

    return log_token

And since this takes the input token and returns the rearranged tokens, you can make it a parse action as-is:

    # trailing '*' on the results name is equivalent to listAllMatches=True
    input_block = Group(Word(alphanums)('func_name') + '_in')('input_blocks*')
    output_block = Group(Word(alphanums)('func_name') +'_out')('output_blocks*')
    log = OneOrMore(input_block | output_block)
    log.addParseAction(rearrange)

Since rearrange updated log_token in place, if you make it a parse action, the ending return statement would be unnecessary.

It is interesting how you were able to update the list in-place by clearing those blocks that you had found matches for - very clever.

Generally, the assembly of tokens into ParseResults is an internal function, so the docs are light on this topic. I was just looking through the module docs and I don't really see a good home for this topic.

这篇关于编辑pyparsing解析结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆