Python Pandas:读取具有多个表的CSV重复的前导码 [英] Python pandas: read csv with multiple tables repeated preamble
问题描述
是否存在一种Python方法来找出CSV文件中的哪些行包含标题和值,哪些行包含垃圾内容,然后将标题/值行添加到数据帧中?
Is there a pythonic way to figure out which rows in a CSV file contain headers and values and which rows contain trash and then get the headers/values rows into data frames?
我是python的新手,并且一直使用它来读取从科学仪器的数据记录中导出的多个CSV,到目前为止,当处理其他任务的CSV时,我始终默认使用pandas
库.但是,这些CSV导出内容可能会有所不同,具体取决于每种乐器上记录的测试"次数.
I'm relatively new to python and have been using it to read multiple CSVs exported from a scientific instrument's datalog, and when dealing with CSVs so far for other tasks I've always defaulted to using the pandas
library. However, these CSV exports can vary depending on the number of "tests" logged on each instrument.
仪器之间的列标题和数据结构相同,但是有一个序言"分隔每个可以更改的测试.因此,我得到的备份看起来像这样(在此示例中,有两个测试,但可能有任意数量的测试):
The column headers and data structure are the same between instruments, but there is a "preamble" separating each test that can change. So I end up with backups that look something like this (for this example there are two tests, but there could be potentially any number of tests):
blah blah here's a test and
here's some information
you don't care about
even a little bit
header1, header2, header3
1, 2, 3
4, 5, 6
oh you have another test
here's some more garbage
that's different than the last one
this should make
life interesting
header1, header2, header3
7, 8, 9
10, 11, 12
13, 14, 15
如果每次我只是使用skiprow参数,它都是固定长度的前同步码,但该前同步码是可变长度的,并且每个测试中的行数是可变长度的.
If it was a fixed length preamble each time I'd just use the skiprow parameter, but the preamble is variable length and the number of rows in each test is of variable length.
我的最终目标是能够合并所有测试,并得到如下结果:
My end goal is to be able to merge all the tests and end up with something like:
header1, header2, header3
1, 2, 3
4, 5, 6
7, 8, 9
10, 11, 12
13, 14, 15
然后我可以像往常一样使用熊猫.
Which I can then manipulate with pandas as usual.
我尝试了以下操作以查找具有所需标题的第一行:
I've tried the following to find the first row with my expected headers:
import csv
import pandas as pd
with open('my_file.csv', 'rb') as input_file:
for row_num, row in enumerate(csv.reader(input_file, delimiter=',')):
# The CSV module will return a blank list []
# so added the len(row)>0 so it doesn't error out
# later when searching for a string
if len(row) > 0:
# There's probably a better way to find it, but I just convert
# the list to a string then search for the expected header
if "['header1', 'header2', 'header3']" in str(row):
header_row = row_num
df = pd.read_csv('my_file.csv', skiprows = header_row, header=0)
print df
如果我只有一个测试,这会起作用,因为它找到了具有标头的第一行,但是当然header_row
变量每找到一次标头就会更新一次,因此在上面的示例中,我最终得到了输出:
This works if I only have one test because it finds the first row that has the headers, but of course the header_row
variable is getting updated each additional time it finds the header, so in the example above I end up with output:
header1 header2 header3
0 7 8 9
1 10 11 12
2 13 14 15
在继续搜索标题/数据集的下一个实例之前,我迷路了,不知道如何将标题/数据集的每个实例附加到数据帧.
I'm getting lost figuring out how to append each instance of the header/dataset to a dataframe before continuing on to searching for the next instance of the header/dataset.
在处理大量文件时,必须先使用csv
模块将其打开,然后再使用pandas
将其打开,这可能不是超级有效.
And it's probably not super efficient when dealing with a large number of files to have to open it once with the csv
module then again with pandas
.
推荐答案
该程序可能会有所帮助.本质上,它是围绕csv.reader()
对象的包装器,该包装器可将良好的数据抓出.
This program might help. It is essentially a wrapper around the csv.reader()
object, which wrapper greps the good data out.
import pandas as pd
import csv
import sys
def ignore_comments(fp, start_fn, end_fn, keep_initial):
state = 'keep' if keep_initial else 'start'
for line in fp:
if state == 'start' and start_fn(line):
state = 'keep'
yield line
elif state == 'keep':
if end_fn(line):
state = 'drop'
else:
yield line
elif state == 'drop':
if start_fn(line):
state = 'keep'
if __name__ == "__main__":
df = open('x.in')
df = csv.reader(df, skipinitialspace=True)
df = ignore_comments(
df,
lambda x: x == ['header1', 'header2', 'header3'],
lambda x: x == [],
False)
df = pd.read_csv(df, engine='python')
print df
这篇关于Python Pandas:读取具有多个表的CSV重复的前导码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!