Python:将结构化文本解析为CSV格式 [英] Python: parsing structured text to CSV format

查看:177
本文介绍了Python:将结构化文本解析为CSV格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用Python将纯结构化文本文件转换为CSV格式.

I want to convert plain structured text files to the CSV format using Python.

输入看起来像这样

[-------- 1 -------]
Version: 2
 Stream: 5
 Account: A
[...]
[------- 2 --------]
 Version: 3
 Stream: 6
 Account: B
[...]

输出应该看起来像这样:

The output is supposed to look like this:

Version; Stream; Account; [...]
2; 5; A; [...]
3; 6; B; [...]

即输入是由[----<sequence number>----]分隔并包含<key>: <values>对的结构化文本记录,输出应为CSV,每行包含一条记录.

I.e. the input is structured text records delimited by [----<sequence number>----] and containing <key>: <values>-pairs and the ouput should be CSV containing one record per line.

我能够通过

<key>: <values>-对恢复为CSV格式

I am able to retrive the <key>: <values>-pairs into CSV format via

colonseperated = re.compile(' *(.+) *: *(.+) *')
fixedfields = re.compile('(\d{3} \w{7}) +(.*)')

-但是我很难识别结构化文本记录的开头和结尾,并且很难将其重写为CSV行记录.此外,我希望能够分离不同类型的记录,即在Version: 2Version: 3类型的记录之间进行区分.

-- but I have trouble to recognize beginning and end of the structured text records and with the re-writing as CSV line-records. Furthermore I would like to be able to separate different type of records, i.e. distinguish between - say - Version: 2 and Version: 3 type of records.

推荐答案

阅读列表并不难:

def read_records(iterable):
    record = {}
    for line in iterable:
        if line.startswith('[------'):
            # new record, yield previous
            if record:
                yield record
            record = {}
            continue
        key, value = line.strip().split(':', 1)
        record[key.strip()] = value.strip()

    # file done, yield last record
    if record:
        yield record

这将从您的输入文件中生成字典.

This produces dictionaries from your input file.

由此,您可以使用csv模块,特别是

From this you can produce CSV output using the csv module, specifically the csv.DictWriter() class:

# List *all* possible keys, in the order the output file should list them
headers = ('Version', 'Stream', 'Account', ...)

with open(inputfile) as infile, open(outputfile, 'wb') as outfile:
    records = read_records(infile)

    writer = csv.DictWriter(outfile, headers, delimiter=';')
    writer.writeheader()

    # and write
    writer.writerows(records)

记录中缺少任何标题键都将使该记录的该列留空.您错过的任何 extra 标头都会引发异常;要么将其添加到headers元组,要么将extrasaction关键字设置为DictWriter()构造函数为'ignore'.

Any header keys missing from a record will leave that column empty for that record. Any extra headers you missed will raise an exception; either add those to the headers tuple, or set the extrasaction keyword to the DictWriter() constructor to 'ignore'.

这篇关于Python:将结构化文本解析为CSV格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆