Python:将多个.txt文件解析为单个.csv文件? [英] Python: Parsing Multiple .txt Files into a Single .csv File?

查看:300
本文介绍了Python:将多个.txt文件解析为单个.csv文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对使用Python进行复杂的大型解析不是很有经验,你们对如何轻松解析具有不同格式的多个文本文件并将其组合成单个.csv文件并最终进入文件有任何提示或指南吗?将它们存入数据库吗?

I'm not very experienced with complicated large-scale parsing in Python, do you guys have any tips or guides on how to easily parse multiple text files with different formats, and combining them into a single .csv file and ultimately entering them into a database?

文本文件的示例如下:

general.txt(名称-部门(DEPT)室#[年龄]

general.txt (Name -- Department (DEPT) Room # [Age]

John Doe -- Management (MANG) 205 [Age: 40]
Equipment: Laptop, Desktop, Printer, Stapler
Experience: Python, Java, HTML
Description: Hardworking, awesome

Mary Smith -- Public Relations (PR) 605 [Age: 24] 
Equipment: Mac, PC
Experience: Social Skills
Description: fun to be around

Scott Lee -- Programmer (PG) 403 [Age: 25]
Equipment: Personal Computer
Experience: HTML, CSS, JS
Description: super-hacker

Susan Kim -- Programmer (PG) 504 [Age: 21]
Equipment: Desktop
Experience: Social Skills
Descriptions: fun to be around

Bob Simon  -- Programmer (PG) 101 [Age: 29]
Equipment: Pure Brain Power
Experience: C++, C, Java 
Description: never comes out of his room

cars.txt(按部门/房间号拥有汽车的人的列表)

cars.txt (a list of people who own cars by their department/room #)

Programmer: PG 403, PG 101
Management: MANG 205

house.txt

house.txt

Programmer: PG 504

最终的csv最好制表成类似

The final csv should preferably tabulate to something like:

Name     | Division    | Division Abbrevation | Equipment | Room | Age | Car? | House? |
Scott Lee  Programming          PG                 PC        403   25     YES     NO 
Mary Smith Public Rel.          PR               Mac, PC     605   24      NO     NO

最终的目标是拥有一个数据库,在该数据库中搜索"PR"将返回某人所在部门为"PR"的每一行,以此类推.等等.总共可能有30个文本文件,每个文本文件代表数据库中的一列或多列.一些专栏是简短的段落,其中包括逗号.总共约10,000行.我知道Python内置了csv,但是我不确定从哪里开始以及如何以1个csv结尾.有帮助吗?

The ultimate goal is to have a database, where searching "PR" would return every row where a person's Department is "PR," etc. There's maybe 30 text files total, each representing one or more columns in a database. Some columns are short paragraphs, which include commas. Around 10,000 rows total. I know Python has built in csv, but I'm not sure where to start, and how to end with just 1 csv. Any help?

推荐答案

您似乎正在寻找可以为您解决整个问题的人.我在这里:)

It looks like you're looking for someone who will solve a whole problem for you. Here I am :)

一般想法是(使用正则表达式)将一般信息解析为dict,然后在其上附加其他字段,最后写入CSV.这是Python 3.x解决方案(我认为Python 2.7+应该足够了):

General idea is to parse general info to dict (using regular expressions), then append additional fields to it and finally write to CSV. Here's Python 3.x solution (I think Python 2.7+ should suffice):

import csv
import re


def read_general(fname):
    # Read general info to dict with 'PR 123'-like keys

    # Gerexp that will split row into ready-to-use dict
    re_name = re.compile(r'''
        (?P<Name>.+)
        \ --\  # Separator + space
        (?P<Division>.+)
        \  # Space
        \(
            (?P<Division_Abbreviation>.*)
        \)
        \  # Space
        (?P<Id>\d+)
        \  # Space
        \[Age:\  # Space at the end
            (?P<Age>\d+)
        \]
        ''', re.X)

    general = {}

    with open(fname, 'rt') as f:
        for line in f:
            line = line.strip()
            m = re_name.match(line)

            if m:
                # Name line, start new man
                man = m.groupdict()
                key = '%s %s' % (m.group('Division_Abbreviation'), m.group('Id'))
                general[key] = man

            elif line:
                # Non empty lines
                # Add values to dict
                key, value = line.split(': ', 1)
                man[key] = value

    return general


def add_bool_criteria(fname, field, general):
    # Append a field with YES/NO value

    with open(fname, 'rt') as f:
        yes_keys = set()

        # Phase one, gather all keys
        for line in f:
            line = line.strip()
            _, keys = line.split(': ', 1)

            yes_keys.update(keys.split(', '))

        # Fill data
        for key, man in general.items():  # iteritems() will be faster in Python 2.x
            man[field] = 'YES' if key in yes_keys else 'NO'


def save_csv(fname, general):
    with open(fname, 'wt') as f:
        # Gather field names
        all_fields = set()
        for value in general.values():
            all_fields.update(value.keys())

        # Write to csv
        w = csv.DictWriter(f, all_fields)
        w.writeheader()
        w.writerows(general.values())


def main():
    general = read_general('general.txt')
    add_bool_criteria('cars.txt', 'Car?', general)
    add_bool_criteria('house.txt', 'House?', general)
    from pprint import pprint
    pprint(general)
    save_csv('result.csv', general)


if __name__ == '__main__':
    main()

我希望你为此付出很多;)

I wish you lot of $$$ for this ;)

CSV是历史记录,您可以使用JSON进行存储和进一步使用,因为它使用起来更简单,更灵活并且更易于阅读.

CSV is a history, you could use JSON for storage and further use, because it's simpler to use, more flexible and human readable.

这篇关于Python:将多个.txt文件解析为单个.csv文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆