解析文本文件中的数据 [英] Parsing data from text file

查看:95
本文介绍了解析文本文件中的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文本文件,其内容如下:

I have a text file that has content like this:

******** ENTRY 01 ********
ID:                  01
Data1:               0.1834869385E-002
Data2:              10.9598489301
Data3:              -0.1091356549E+001
Data4:                715

然后是一个空行,并重复更多类似的块,所有这些块都具有相同的数据字段.

And then an empty line, and repeats more similar blocks, all of them with the same data fields.

我正在将C ++代码移植到Python,并且某个部分逐行获取文件,检测文本标题,然后检测每个字段文本以提取数据.这看起来根本不是一个智能代码,而且我认为Python必须具有一些库才能轻松解析此类数据.毕竟,它几乎看起来像是CSV!

I am porting to Python a C++ code, and a certain part gets the file line by line, detects the text title and then detect each field text to extract the data. This doesn't look like a smart code at all, and I think Python must have some library to parse data like this easily. After all, it almost look like a CSV!

对此有任何想法吗?

推荐答案

实际上,它与CSV相距很远.

It is very far from CSV, actually.

您可以将该文件用作迭代器;以下生成器函数将产生完整的部分:

You can use the file as an iterator; the following generator function yields complete sections:

def load_sections(filename):
    with open(filename, 'r') as infile:
        line = ''
        while True:
            while not line.startswith('****'): 
                line = next(infile)  # raises StopIteration, ending the generator
                continue  # find next entry

            entry = {}
            for line in infile:
                line = line.strip()
                if not line: break

                key, value = map(str.strip, line.split(':', 1))
                entry[key] = value

            yield entry

这会将文件视为迭代器,这意味着任何循环都会将文件前进到下一行.外循环仅用于一个部分到另一个部分的移动.内部的whilefor循环可以完成所有实际工作;首先跳过行,直到找到****标头节(否则将其丢弃),然后循环遍历所有非空行以创建节.

This treats the file as an iterator, meaning that any looping advances the file to the next line. The outer loop only serves to move from section to section; the inner while and for loops do all the real work; first skip lines until a **** header section is found (otherwise discarded), then loop over all non-empty lines to create a section.

循环使用该功能:

for section in load_sections(filename):
    print section

在文本文件中重复样本数据会导致:

Repeating your sample data in a text file results in:

>>> for section in load_sections('/tmp/test.txt'):
...     print section
... 
{'Data4': '715', 'Data1': '0.1834869385E-002', 'ID': '01', 'Data3': '-0.1091356549E+001', 'Data2': '10.9598489301'}
{'Data4': '715', 'Data1': '0.1834869385E-002', 'ID': '01', 'Data3': '-0.1091356549E+001', 'Data2': '10.9598489301'}
{'Data4': '715', 'Data1': '0.1834869385E-002', 'ID': '01', 'Data3': '-0.1091356549E+001', 'Data2': '10.9598489301'}

如果需要,可以在其中添加一些数据转换器;键到callable的映射会做到:

You can add some data converters to that if you want to; a mapping of key to callable would do:

converters = {'ID': int, 'Data1': float, 'Data2': float, 'Data3': float, 'Data4': int}

然后在生成器函数中执行entry[key] = converters.get(key, lambda v: v)(value),而不是entry[key] = value.

then in the generator function, instead of entry[key] = value do entry[key] = converters.get(key, lambda v: v)(value).

这篇关于解析文本文件中的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆