如何在循环中附加多个 pandas 数据帧? [英] How to append multiple pandas DataFrames in a loop?
问题描述
我一直在思考这个 python 问题,但被卡住了.我正在循环遍历几个 csv 文件,并希望有一个数据框以一种方式附加 csv 文件,即每个 csv 文件中的一列是列名并设置 date_time 的公共索引.
除了 value
和 pod
编号不同外,有 11 个 csv 文件看起来像这个数据框,但是 time_stamp
是相同的所有的 csv.
数据
pod 时间戳值0 97 2016-02-22 3.0480001 97 2016-02-29 23.6220012 97 2016-03-07 13.9700013 97 2016-03-14 6.6040004 97 2016-03-21 NaN
这是我到目前为止的 for 循环:
import glob将熊猫导入为 pd文件名 = 排序(glob.glob('*.csv'))新 = []对于文件名中的 f:数据 = pd.read_csv(f)time_stamp = [pd.to_datetime(d) for d in time_stamp]新的.追加(数据)my_df = pd.DataFrame(new, columns=['pod','time_stamp','value'])
我想要的是一个看起来像这样的数据框,其中每一列都是来自每个 csv 文件的 value
的结果.
time_stamp 97 98 99 ...2016-02-22 3.04800 4.20002 3.55002016-02-29.23.62201 24.7392 21.11102016-03-07 13.97001 11.0284 12.0000
但是现在 my_df
的输出非常错误,看起来像这样.知道我哪里出错了吗?
<代码> 00 pod 时间戳值 0 22 2016-...1 pod 时间戳值 0 72 2016-...2 pod 时间戳值 0 79 2016-0...3 pod 时间戳值 0 86 2016-...4 pod 时间戳值 0 87 2016-...5 pod 时间戳值 0 88 2016-...6 pod 时间戳值 0 90 2016-0...7 pod 时间戳值 0 93 2016-0...8 pod 时间戳值 0 95 2016-...
我建议首先将所有数据帧与 pd.concat
连接在一起,然后再做一个最终的
filenames = sorted(glob.glob('*.csv'))new = [pd.read_csv(f, parse_dates=['time_stamp']) for f in filenames]df = pd.concat(new) # 省略轴参数,因为它默认为 0df = df.pivot(index='time_stamp', columns='pod')
请注意,当加载数据帧时,我强制read_csv
解析time_stamp
,因此不再需要在加载后解析.><小时>
MCVE
dfpod 时间戳值0 97 2016-02-22 3.0480001 97 2016-02-29 23.6220012 97 2016-03-07 13.9700013 97 2016-03-14 6.6040004 97 2016-03-21 NaNdf.pivot(index='time_stamp', columns='pod')价值吊舱 97时间戳2016-02-22 3.0480002016-02-29 23.6220012016-03-07 13.9700012016-03-14 6.6040002016-03-21 NaN
I've been banging my head on this python problem for a while and am stuck. I am for-looping through several csv files and want one data frame that appends the csv files in a way that one column from each csv file is a column name and sets a common index of a date_time.
There are 11 csv files that look like this data frame except for different value
and pod
number, but the time_stamp
is the same for all the csvs.
data
pod time_stamp value
0 97 2016-02-22 3.048000
1 97 2016-02-29 23.622001
2 97 2016-03-07 13.970001
3 97 2016-03-14 6.604000
4 97 2016-03-21 NaN
And this is the for-loop that I have so far:
import glob
import pandas as pd
filenames = sorted(glob.glob('*.csv'))
new = []
for f in filenames:
data = pd.read_csv(f)
time_stamp = [pd.to_datetime(d) for d in time_stamp]
new.append(data)
my_df = pd.DataFrame(new, columns=['pod','time_stamp','value'])
What I want is a data frame that looks like this where each column is the result of value
from each of the csv files.
time_stamp 97 98 99 ...
2016-02-22 3.04800 4.20002 3.5500
2016-02-29. 23.62201 24.7392 21.1110
2016-03-07 13.97001 11.0284 12.0000
But right now the output of my_df
is very wrong and looks like this. Any ideas of where I went wrong?
0
0 pod time_stamp value 0 22 2016-...
1 pod time_stamp value 0 72 2016-...
2 pod time_stamp value 0 79 2016-0...
3 pod time_stamp value 0 86 2016-...
4 pod time_stamp value 0 87 2016-...
5 pod time_stamp value 0 88 2016-...
6 pod time_stamp value 0 90 2016-0...
7 pod time_stamp value 0 93 2016-0...
8 pod time_stamp value 0 95 2016-...
I'd recommend first concatenating all your dataframes together with pd.concat
, and then doing one final pivot
operation.
filenames = sorted(glob.glob('*.csv'))
new = [pd.read_csv(f, parse_dates=['time_stamp']) for f in filenames]
df = pd.concat(new) # omit axis argument since it is 0 by default
df = df.pivot(index='time_stamp', columns='pod')
Note that I'm forcing read_csv
to parse time_stamp
when loading the dataframe, so parsing after loading is no longer required.
MCVE
df
pod time_stamp value
0 97 2016-02-22 3.048000
1 97 2016-02-29 23.622001
2 97 2016-03-07 13.970001
3 97 2016-03-14 6.604000
4 97 2016-03-21 NaN
df.pivot(index='time_stamp', columns='pod')
value
pod 97
time_stamp
2016-02-22 3.048000
2016-02-29 23.622001
2016-03-07 13.970001
2016-03-14 6.604000
2016-03-21 NaN
这篇关于如何在循环中附加多个 pandas 数据帧?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!