解析文本文件中的数据 [英] Parsing data from text file
问题描述
我有一个文本文件,其内容如下:
I have a text file that has content like this:
******** ENTRY 01 ********
ID: 01
Data1: 0.1834869385E-002
Data2: 10.9598489301
Data3: -0.1091356549E+001
Data4: 715
然后是一个空行,并重复更多类似的块,所有这些块都具有相同的数据字段.
And then an empty line, and repeats more similar blocks, all of them with the same data fields.
我正在将C ++代码移植到Python,并且某个部分逐行获取文件,检测文本标题,然后检测每个字段文本以提取数据.这看起来根本不是一个智能代码,而且我认为Python必须具有一些库才能轻松解析此类数据.毕竟,它几乎看起来像是CSV!
I am porting to Python a C++ code, and a certain part gets the file line by line, detects the text title and then detect each field text to extract the data. This doesn't look like a smart code at all, and I think Python must have some library to parse data like this easily. After all, it almost look like a CSV!
对此有任何想法吗?
推荐答案
实际上,它与CSV相距很远.
It is very far from CSV, actually.
您可以将该文件用作迭代器;以下生成器函数将产生完整的部分:
You can use the file as an iterator; the following generator function yields complete sections:
def load_sections(filename):
with open(filename, 'r') as infile:
line = ''
while True:
while not line.startswith('****'):
line = next(infile) # raises StopIteration, ending the generator
continue # find next entry
entry = {}
for line in infile:
line = line.strip()
if not line: break
key, value = map(str.strip, line.split(':', 1))
entry[key] = value
yield entry
这会将文件视为迭代器,这意味着任何循环都会将文件前进到下一行.外循环仅用于一个部分到另一个部分的移动.内部的while
和for
循环可以完成所有实际工作;首先跳过行,直到找到****
标头节(否则将其丢弃),然后循环遍历所有非空行以创建节.
This treats the file as an iterator, meaning that any looping advances the file to the next line. The outer loop only serves to move from section to section; the inner while
and for
loops do all the real work; first skip lines until a ****
header section is found (otherwise discarded), then loop over all non-empty lines to create a section.
循环使用该功能:
for section in load_sections(filename):
print section
在文本文件中重复样本数据会导致:
Repeating your sample data in a text file results in:
>>> for section in load_sections('/tmp/test.txt'):
... print section
...
{'Data4': '715', 'Data1': '0.1834869385E-002', 'ID': '01', 'Data3': '-0.1091356549E+001', 'Data2': '10.9598489301'}
{'Data4': '715', 'Data1': '0.1834869385E-002', 'ID': '01', 'Data3': '-0.1091356549E+001', 'Data2': '10.9598489301'}
{'Data4': '715', 'Data1': '0.1834869385E-002', 'ID': '01', 'Data3': '-0.1091356549E+001', 'Data2': '10.9598489301'}
如果需要,可以在其中添加一些数据转换器;键到callable的映射会做到:
You can add some data converters to that if you want to; a mapping of key to callable would do:
converters = {'ID': int, 'Data1': float, 'Data2': float, 'Data3': float, 'Data4': int}
然后在生成器函数中执行entry[key] = converters.get(key, lambda v: v)(value)
,而不是entry[key] = value
.
then in the generator function, instead of entry[key] = value
do entry[key] = converters.get(key, lambda v: v)(value)
.
这篇关于解析文本文件中的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!