延迟解析Python中每条记录数据流的有状态多行? [英] Lazy parse a stateful, multiline per record data stream in Python?
本文介绍了延迟解析Python中每条记录数据流的有状态多行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
这是一个文件的外观:
BEGIN_META
stuff
to
discard
END_META
BEGIN_DB
header
to
discard
data I
wish to
extract
END_DB
我希望能够解析所有cat
在一起的无限流,这排除了执行re.findall('something useful', '\n'.join(sys.stdin), re.M)
之类的事情的可能性.
I'd like to be able to parse an infinite stream of them all cat
'd together, which precludes doing something like re.findall('something useful', '\n'.join(sys.stdin), re.M)
.
下面是我的尝试,但我必须强制从get_raw_table()
返回的生成器,因此它不太符合要求.消除外力意味着您无法测试返回的发电机是否为空,因此您无法查看sys.stdin
是否为空.
Below is my attempt, but I have to force the generator returned from get_raw_table()
so it doesn't quite fit the requirements. Removing the force means you can't test if the returned generator is empty or not, so you cannot see if sys.stdin
is empty.
def get_raw_table(it):
state = 'begin'
for line in it:
if line.startswith('BEGIN_DB'):
state = 'discard'
elif line.startswith('END_DB'):
return
elif state is 'discard' and not line.strip():
state = 'take'
elif state is 'take' and line:
yield line.strip().strip('#').split()
# raw_tables is a list (per file) of lists (per row) of lists (per column)
raw_tables = []
while True:
result = list(get_raw_table(sys.stdin))
if result:
raw_tables.append(result)
else:
break
推荐答案
类似的方法可能有效:
import itertools
def chunks(it):
while True:
it = itertools.dropwhile(lambda x: 'BEGIN_DB' not in x, it)
it = itertools.dropwhile(lambda x: x.strip(), it)
next(it)
yield itertools.takewhile(lambda x: 'END_DB' not in x, it)
例如:
src = """
BEGIN_META
stuff
to
discard
END_META
BEGIN_DB
header
to
discard
1data I
1wish to
1extract
END_DB
BEGIN_META
stuff
to
discard
END_META
BEGIN_DB
header
to
discard
2data I
2wish to
2extract
END_DB
"""
src = iter(src.splitlines())
for chunk in chunks(src):
for line in chunk:
print line.strip()
print
这篇关于延迟解析Python中每条记录数据流的有状态多行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文