延迟解析Python中每条记录数据流的有状态多行? [英] Lazy parse a stateful, multiline per record data stream in Python?

查看:100
本文介绍了延迟解析Python中每条记录数据流的有状态多行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个文件的外观:

BEGIN_META
    stuff
    to
    discard
END_META
BEGIN_DB
    header
    to
    discard

    data I
    wish to
    extract
 END_DB

我希望能够解析所有cat在一起的无限流,这排除了执行re.findall('something useful', '\n'.join(sys.stdin), re.M)之类的事情的可能性.

I'd like to be able to parse an infinite stream of them all cat'd together, which precludes doing something like re.findall('something useful', '\n'.join(sys.stdin), re.M).

下面是我的尝试,但我必须强制从get_raw_table()返回的生成器,因此它不太符合要求.消除外力意味着您无法测试返回的发电机是否为空,因此您无法查看sys.stdin是否为空.

Below is my attempt, but I have to force the generator returned from get_raw_table() so it doesn't quite fit the requirements. Removing the force means you can't test if the returned generator is empty or not, so you cannot see if sys.stdin is empty.

def get_raw_table(it):
    state = 'begin'
    for line in it:
        if line.startswith('BEGIN_DB'):
            state = 'discard'
        elif line.startswith('END_DB'):
            return
        elif state is 'discard' and not line.strip():
            state = 'take'
        elif state is 'take' and line:
            yield line.strip().strip('#').split()

# raw_tables is a list (per file) of lists (per row) of lists (per column)
raw_tables = []
while True:
    result = list(get_raw_table(sys.stdin))
    if result:
        raw_tables.append(result)
    else:
        break

推荐答案

类似的方法可能有效:

import itertools

def chunks(it):
    while True:
        it = itertools.dropwhile(lambda x: 'BEGIN_DB' not in x, it)
        it = itertools.dropwhile(lambda x: x.strip(), it)
        next(it)
        yield itertools.takewhile(lambda x: 'END_DB' not in x, it)

例如:

src = """
BEGIN_META
    stuff
    to
    discard
END_META
BEGIN_DB
    header
    to
    discard

    1data I
    1wish to
    1extract
 END_DB


BEGIN_META
    stuff
    to
    discard
END_META
BEGIN_DB
    header
    to
    discard

    2data I
    2wish to
    2extract
 END_DB
"""


src = iter(src.splitlines())
for chunk in chunks(src):
    for line in chunk:
        print line.strip()
    print

这篇关于延迟解析Python中每条记录数据流的有状态多行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆