Python Pandas:读取具有多个表的CSV重复的前导码 [英] Python pandas: read csv with multiple tables repeated preamble

查看:74
本文介绍了Python Pandas:读取具有多个表的CSV重复的前导码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否存在一种Python方法来找出CSV文件中的哪些行包含标题和值,哪些行包含垃圾内容,然后将标题/值行添加到数据帧中?

Is there a pythonic way to figure out which rows in a CSV file contain headers and values and which rows contain trash and then get the headers/values rows into data frames?

我是python的新手,并且一直使用它来读取从科学仪器的数据记录中导出的多个CSV,到目前为止,当处理其他任务的CSV时,我始终默认使用pandas库.但是,这些CSV导出内容可能会有所不同,具体取决于每种乐器上记录的测试"次数.

I'm relatively new to python and have been using it to read multiple CSVs exported from a scientific instrument's datalog, and when dealing with CSVs so far for other tasks I've always defaulted to using the pandas library. However, these CSV exports can vary depending on the number of "tests" logged on each instrument.

仪器之间的列标题和数据结构相同,但是有一个序言"分隔每个可以更改的测试.因此,我得到的备份看起来像这样(在此示例中,有两个测试,但可能有任意数量的测试):

The column headers and data structure are the same between instruments, but there is a "preamble" separating each test that can change. So I end up with backups that look something like this (for this example there are two tests, but there could be potentially any number of tests):

blah blah here's a test and  
here's some information  
you don't care about  
even a little bit  
header1, header2, header3  
1, 2, 3  
4, 5, 6  

oh you have another test  
here's some more garbage  
that's different than the last one  
this should make  
life interesting  
header1, header2, header3  
7, 8, 9  
10, 11, 12  
13, 14, 15  

如果每次我只是使用skiprow参数,它都是固定长度的前同步码,但该前同步码是可变长度的,并且每个测试中的行数是可变长度的.

If it was a fixed length preamble each time I'd just use the skiprow parameter, but the preamble is variable length and the number of rows in each test is of variable length.

我的最终目标是能够合并所有测试,并得到如下结果:

My end goal is to be able to merge all the tests and end up with something like:

header1, header2, header3  
1, 2, 3  
4, 5, 6  
7, 8, 9  
10, 11, 12  
13, 14, 15  

然后我可以像往常一样使用熊猫.

Which I can then manipulate with pandas as usual.

我尝试了以下操作以查找具有所需标题的第一行:

I've tried the following to find the first row with my expected headers:

import csv
import pandas as pd

with open('my_file.csv', 'rb') as input_file:    
    for row_num, row in enumerate(csv.reader(input_file, delimiter=',')):
        # The CSV module will return a blank list []
        # so added the len(row)>0 so it doesn't error out
        # later when searching for a string
        if len(row) > 0:
            # There's probably a better way to find it, but I just convert
            # the list to a string then search for the expected header
            if "['header1', 'header2', 'header3']" in str(row):
                header_row = row_num

    df = pd.read_csv('my_file.csv', skiprows = header_row, header=0)
    print df

如果我只有一个测试,这会起作用,因为它找到了具有标头的第一行,但是当然header_row变量每找到一次标头就会更新一次,因此在上面的示例中,我最终得到了输出:

This works if I only have one test because it finds the first row that has the headers, but of course the header_row variable is getting updated each additional time it finds the header, so in the example above I end up with output:

   header1   header2   header3  
0        7         8           9
1       10        11          12
2       13        14          15

在继续搜索标题/数据集的下一个实例之前,我迷路了,不知道如何将标题/数据集的每个实例附加到数据帧.

I'm getting lost figuring out how to append each instance of the header/dataset to a dataframe before continuing on to searching for the next instance of the header/dataset.

在处理大量文件时,必须先使用csv模块将其打开,然后再使用pandas将其打开,这可能不是超级有效.

And it's probably not super efficient when dealing with a large number of files to have to open it once with the csv module then again with pandas.

推荐答案

该程序可能会有所帮助.本质上,它是围绕csv.reader()对象的包装器,该包装器可将良好的数据抓出.

This program might help. It is essentially a wrapper around the csv.reader() object, which wrapper greps the good data out.

import pandas as pd
import csv
import sys


def ignore_comments(fp, start_fn, end_fn, keep_initial):
    state = 'keep' if keep_initial else 'start'
    for line in fp:
        if state == 'start' and start_fn(line):
            state = 'keep'
            yield line
        elif state == 'keep':
            if end_fn(line):
                state = 'drop'
            else:
                yield line
        elif state == 'drop':
            if start_fn(line):
                state = 'keep'

if __name__ == "__main__":

    df = open('x.in')
    df = csv.reader(df, skipinitialspace=True)
    df = ignore_comments(
        df,
        lambda x: x == ['header1', 'header2', 'header3'],
        lambda x: x == [],
        False)

    df = pd.read_csv(df, engine='python')
    print df

这篇关于Python Pandas:读取具有多个表的CSV重复的前导码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆