如何正确地重新定位或重塑 pandas 中的时间序列数据框? [英] How to properly pivot or reshape a timeseries dataframe in Pandas?
问题描述
我需要重塑一个看起来像df1并将其转换成df2的数据帧。此过程有两个注意事项:
I need to reshape a dataframe that looks like df1 and turn it into df2. There are 2 considerations for this procedure:
- 我需要设置要切片的行数作为参数(长度)
- 我需要从索引中分割日期和时间,并在重塑中使用日期作为列名称,并保持时间作为索引。
当前df1
2007-08-07 18:00:00 1
2007-08-08 00:00:00 2
2007-08-08 06:00:00 3
2007-08-08 12:00:00 4
2007-08-08 18:00:00 5
2007-11-02 18:00:00 6
2007-11-03 00:00:00 7
2007-11-03 06:00:00 8
2007-11-03 12:00:00 9
2007-11-03 18:00:00 10
所需的输出df2 - 参数'length = 5'
Desired Output df2 - With the parameter 'length=5'
2007-08-07 2007-11-02
18:00:00 1 6
00:00:00 2 7
06:00:00 3 8
12:00:00 4 9
18:00:00 5 10
我做了什么:
我的方法是创建一个多索引(Date - Time),然后做一个数据透视表或某种重塑实现所需的df输出。
My approach was to create a multi-index (Date - Time) and then do a pivot table or some sort of reshape to achieve the desired df output.
import pandas as pd
'''
First separate time and date
'''
df['TimeStamp'] = df.index
df['date'] = df.index.date
df['time'] = df.index.time
'''
Then create a way to separate the slices and make those specific dates available for then create
a multi-index.
'''
for index, row in df.iterrows():
df['Num'] = np.arange(len(df))
for index, row in df.iterrows():
if row['Num'] % 5 == 0:
df.loc[index, 'EventDate'] = df.loc[index, 'Date']
df.set_index(['EventDate', 'Hour'], inplace=True)
del df['Date']
del df['Num']
del df['TimeStamp']
问题:每个日期为多级索引的第一级。即使这样做很顺利,我也找不到如何用多指标df来做我需要的。
Problem: There's a NaN appears next to each date of the first level of the multi-index. And even if that worked well, I can't find how to do what I need with a multiindex df.
我被卡住了。我感谢任何投入。
I'm stuck. I appreciate any input.
推荐答案
import numpy as np
import pandas as pd
import io
data = '''\
val
2007-08-07 18:00:00 1
2007-08-08 00:00:00 2
2007-08-08 06:00:00 3
2007-08-08 12:00:00 4
2007-08-08 18:00:00 5
2007-11-02 18:00:00 6
2007-11-03 00:00:00 7
2007-11-03 06:00:00 8
2007-11-03 12:00:00 9
2007-11-03 18:00:00 10'''
df = pd.read_table(io.BytesIO(data), sep='\s{2,}', parse_dates=True)
chunksize = 5
chunks = len(df)//chunksize
df['Date'] = np.repeat(df.index.date[::chunksize], chunksize)[:len(df)]
index = df.index.time[:chunksize]
df['Time'] = np.tile(np.arange(chunksize), chunks)
df = df.set_index(['Date', 'Time'], append=False)
df = df['val'].unstack('Date')
df.index = index
print(df)
收益
Date 2007-08-07 2007-11-02
18:00:00 1 6
00:00:00 2 7
06:00:00 3 8
12:00:00 4 9
18:00:00 5 10
请注意,最终的DataFrame具有非索引 - 单项条目。 (
18:00:00
被重复。)当
索引重复输入时,某些DataFrame操作是有问题的,因此一般情况下更好如果
可能,避免这种情况。
Note that the final DataFrame has an index with non-unique entries. (The
18:00:00
is repeated.) Some DataFrame operations are problematic when the
index has repeated entries, so in general it is better to avoid this if
possible.
这篇关于如何正确地重新定位或重塑 pandas 中的时间序列数据框?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!