如何正确地重新定位或重塑 pandas 中的时间序列数据框? [英] How to properly pivot or reshape a timeseries dataframe in Pandas?

查看:163
本文介绍了如何正确地重新定位或重塑 pandas 中的时间序列数据框?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要重塑一个看起来像df1并将其转换成df2的数据帧。此过程有两个注意事项:

I need to reshape a dataframe that looks like df1 and turn it into df2. There are 2 considerations for this procedure:


  • 我需要设置要切片的行数作为参数(长度)

  • 我需要从索引中分割日期和时间,并在重塑中使用日期作为列名称,并保持时间作为索引。

当前df1

2007-08-07 18:00:00    1
2007-08-08 00:00:00    2
2007-08-08 06:00:00    3
2007-08-08 12:00:00    4
2007-08-08 18:00:00    5
2007-11-02 18:00:00    6
2007-11-03 00:00:00    7
2007-11-03 06:00:00    8
2007-11-03 12:00:00    9
2007-11-03 18:00:00   10

所需的输出df2 - 参数'length = 5'

Desired Output df2 - With the parameter 'length=5'

          2007-08-07  2007-11-02
18:00:00      1           6
00:00:00      2           7
06:00:00      3           8
12:00:00      4           9
18:00:00      5          10

我做了什么:

我的方法是创建一个多索引(Date - Time),然后做一个数据透视表或某种重塑实现所需的df输出。

My approach was to create a multi-index (Date - Time) and then do a pivot table or some sort of reshape to achieve the desired df output.

import pandas as pd 
'''
First separate time and date
'''
df['TimeStamp'] = df.index
df['date'] = df.index.date
df['time'] = df.index.time
'''
Then create a way to separate the slices and make those specific dates available for then create   
a multi-index.
'''
for index, row in df.iterrows():
    df['Num'] = np.arange(len(df))

for index, row in df.iterrows():
    if row['Num'] % 5 == 0:
        df.loc[index, 'EventDate'] = df.loc[index, 'Date']

df.set_index(['EventDate', 'Hour'], inplace=True)
del df['Date']
del df['Num']
del df['TimeStamp']

问题:每个日期为多级索引的第一级。即使这样做很顺利,我也找不到如何用多指标df来做我需要的。

Problem: There's a NaN appears next to each date of the first level of the multi-index. And even if that worked well, I can't find how to do what I need with a multiindex df.

我被卡住了。我感谢任何投入。

I'm stuck. I appreciate any input.

推荐答案

import numpy as np
import pandas as pd
import io

data = '''\
                      val
2007-08-07 18:00:00    1
2007-08-08 00:00:00    2
2007-08-08 06:00:00    3
2007-08-08 12:00:00    4
2007-08-08 18:00:00    5
2007-11-02 18:00:00    6
2007-11-03 00:00:00    7
2007-11-03 06:00:00    8
2007-11-03 12:00:00    9
2007-11-03 18:00:00   10'''

df = pd.read_table(io.BytesIO(data), sep='\s{2,}', parse_dates=True)

chunksize = 5
chunks = len(df)//chunksize

df['Date'] = np.repeat(df.index.date[::chunksize], chunksize)[:len(df)]
index = df.index.time[:chunksize]
df['Time'] = np.tile(np.arange(chunksize), chunks)
df = df.set_index(['Date', 'Time'], append=False)

df = df['val'].unstack('Date')
df.index = index
print(df)

收益

Date      2007-08-07  2007-11-02
18:00:00           1           6
00:00:00           2           7
06:00:00           3           8
12:00:00           4           9
18:00:00           5          10






请注意,最终的DataFrame具有非索引 - 单项条目。 (
18:00:00 被重复。)当
索引重复输入时,某些DataFrame操作是有问题的,因此一般情况下更好如果
可能,避免这种情况。


Note that the final DataFrame has an index with non-unique entries. (The 18:00:00 is repeated.) Some DataFrame operations are problematic when the index has repeated entries, so in general it is better to avoid this if possible.

这篇关于如何正确地重新定位或重塑 pandas 中的时间序列数据框?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆