pandas :将日期范围解压缩为单个日期 [英] Pandas: decompress date range to individual dates
问题描述
数据集:我有一个1GB的股票数据集,其值在日期范围之间.日期范围没有重叠,并且数据集的排序依据是:(ticker,start_date).
Dataset: I have a 1GB dataset of stocks, which have values between date ranges. There is no overlapping in date ranges and the dataset is sorted on (ticker, start_date).
>>> df.head()
start_date end_date val
ticker
AAPL 2014-05-01 2014-05-01 10.0000000000
AAPL 2014-06-05 2014-06-10 20.0000000000
GOOG 2014-06-01 2014-06-15 50.0000000000
MSFT 2014-06-16 2014-06-16 None
TWTR 2014-01-17 2014-05-17 10.0000000000
目标:我想解压缩数据框,以便有单独的日期而不是日期范围.例如,AAPL行将从仅2行变为7行:
Goal: I want to decompress the dataframe so that I have individual dates instead of date ranges. For example, the AAPL rows would go from being only 2 rows to 7 rows:
>>> AAPL_decompressed.head()
val
date
2014-05-01 10.0000000000
2014-06-05 20.0000000000
2014-06-06 20.0000000000
2014-06-07 20.0000000000
2014-06-08 20.0000000000
我希望熊猫能够提供一种不错的优化方法,例如重新采样,可以在几行中做到这一点.
I'm hoping there's a nice optimized method from pandas like resample that can do this in a couple lines.
推荐答案
多于几行,但我认为这会导致您提出以下要求:
A bit more than a few lines, but I think it results in what you asked:
从数据框开始:
In [70]: df
Out[70]:
start_date end_date val row
ticker
AAPL 2014-05-01 2014-05-01 10 0
AAPL 2014-06-05 2014-06-10 20 1
GOOG 2014-06-01 2014-06-15 50 2
MSFT 2014-06-16 2014-06-16 NaN 3
TWTR 2014-01-17 2014-05-17 10 4
首先,我将此数据框重塑为具有一个date
列的数据框(因此,对于start_date
和end_date
的每个日期,每行重复两次(然后添加一个名为row
的计数器列):>
First I reshape this dataframe to a dataframe with one date
column (so every row two times repeated for each date of start_date
and end_date
(and I add a counter column called row
):
In [60]: df['row'] = range(len(df))
In [61]: starts = df[['start_date', 'val', 'row']].rename(columns={'start_date': 'date'})
In [62]: ends = df[['end_date', 'val', 'row']].rename(columns={'end_date':'date'})
In [63]: df_decomp = pd.concat([starts, ends])
In [64]: df_decomp = df_decomp.set_index('row', append=True)
In [65]: df_decomp.sort_index()
Out[65]:
date val
ticker row
AAPL 0 2014-05-01 10
0 2014-05-01 10
1 2014-06-05 20
1 2014-06-10 20
GOOG 2 2014-06-01 50
2 2014-06-15 50
MSFT 3 2014-06-16 NaN
3 2014-06-16 NaN
TWTR 4 2014-01-17 10
4 2014-05-17 10
基于此新数据框,我可以将其按ticker
和row
分组,然后每天对这些组和fillna
分别应用resample
(使用"pad"方法来进行前填充)>
Based on this new dataframe, I can group it by ticker
and row
, and apply a daily resample
on each of these groups and fillna
(with method 'pad' to forward fill)
In [66]: df_decomp = df_decomp.groupby(level=[0,1]).apply(lambda x: x.set_index('date').resample('D').fillna(method='pad'))
In [67]: df_decomp = df_decomp.reset_index(level=1, drop=True)
最后一条命令是删除现在多余的row
索引级别.
当我们访问AAPL行时,它会提供您所需的输出:
The last command was to drop the now superfluous row
index level.
When we access the AAPL rows, it gives your desired output:
In [69]: df_decomp.loc['AAPL']
Out[69]:
val
date
2014-05-01 10
2014-06-05 20
2014-06-06 20
2014-06-07 20
2014-06-08 20
2014-06-09 20
2014-06-10 20
这篇关于 pandas :将日期范围解压缩为单个日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!