带有预测数据的多索引数据帧上的 LSTM/RNN 预处理 [英] LSTM/RNN pre processing on MultiIndex DataFrame with forecast data
问题描述
RNN 和 LSTM 需要为每个特征数据点定义序列.
RNN and LSTM requires to define sequences for each feature data point.
预测数据(例如天气预报)的特点是具有计算时间戳和预测时间戳(此处为 dt_calc
和 dt_fore
).这样的数据可能会产生这样的数据框:
Forecast data (e.g. weather forecast) are characterized by having a calculation timestamp and a forecast timestamp (here dt_calc
and dt_fore
). Such data could yield to a dataframe like this:
data = pd.DataFrame([[2, 3], [4, 5], [6, 7], [8, 9], [10, 11], [12, 13], [9, 8], [8, 9], [5, 4], [3, 3]],
index=pd.MultiIndex.from_tuples([
(pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 00:00:00'), 0),
(pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 01:00:00'), 0),
(pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 02:00:00'), 0),
(pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 03:00:00'), 0),
(pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 04:00:00'), 0),
(pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 00:00:00'), 0),
(pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 01:00:00'), 0),
(pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 02:00:00'), 0),
(pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 03:00:00'), 0),
(pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 04:00:00'), 0)
],
names=['dt_calc', 'dt_fore', 'positional_index']), columns=['temp', 'temp_2'])
对于长度为 2 的序列,用于 LSTM 或 RNN 的数据集应如下所示:
For a sequence length of 2 a dataset to use in LSTM or RNN should look like this:
data = pd.DataFrame([[[2, 4], [3, 5]], [[4, 6], [5, 7]], [[6, 8], [7, 9]], [[8, 10], [9, 11]], [[12, 9], [13, 8]], [[9, 8], [8, 9]], [[8, 5], [9, 4]], [[5, 3], [4, 3]]],
index=pd.MultiIndex.from_tuples([
(pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 01:00:00'), 0),
(pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 02:00:00'), 0),
(pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 03:00:00'), 0),
(pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 04:00:00'), 0),
(pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 01:00:00'), 0),
(pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 02:00:00'), 0),
(pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 03:00:00'), 0),
(pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 04:00:00'), 0)
],
names=['dt_calc', 'dt_fore', 'positional_index']), columns=['temp', 'temp_2'])
这里是序列长度 3:
data = pd.DataFrame([[[2, 4, 6], [3, 5, 7]], [[4, 6, 8], [5, 7, 9]], [[6, 8, 10], [7, 9, 11]], [[12, 9, 8], [13, 8, 9]], [[9, 8, 5], [8, 9, 4]], [[8, 5, 3], [9, 4, 3]]],
index=pd.MultiIndex.from_tuples([
(pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 02:00:00'), 0),
(pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 03:00:00'), 0),
(pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 04:00:00'), 0),
(pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 02:00:00'), 0),
(pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 03:00:00'), 0),
(pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 04:00:00'), 0)
],
names=['dt_calc', 'dt_fore', 'positional_index']), columns=['temp', 'temp_2'])
这个数据框可以很容易地转换成一个带有序列的 numpy 数组.
This dataframe can be transformed to a numpy array with sequences easiely.
这个问题的重要性在于注意时间戳,因为在这种情况下,序列是由时间段而不是索引定义的.
The importance within this question is to take care about the timestamps because a sequence is defined by a time period and not by index in this case.
在 Shubham Sharma 提出一个很好的建议之后:我将概述另一个例子来阐明考虑时间戳的重要性.因为如果 dt_fore 中的间隔不规则,则会出现以下输入:
After a good suggestion by Shubham Sharma: I will outline another example to clarify the importance of taking the timestamps into account. because in case of irregular intervals in dt_fore it come to the following input:
data = pd.DataFrame([[2, 3], [4, 5], [6, 7], [8, 9], [10, 11], [12, 13], [9, 8], [8, 9], [5, 4], [3, 3]],
index=pd.MultiIndex.from_tuples([
(pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 00:00:00'), 0),
(pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 01:00:00'), 0),
(pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 02:00:00'), 0),
(pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 03:00:00'), 0),
(pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 04:00:00'), 0),
(pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 00:00:00'), 0),
(pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 01:00:00'), 0),
(pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 02:00:00'), 0),
(pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 04:00:00'), 0),
(pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 05:00:00'), 0)
],
names=['dt_calc', 'dt_fore', 'positional_index']), columns=['temp', 'temp_2'])
这应该为使用 n=2 的 LSTM/RNN 进行重组:
This should restructured for LSTM/RNN use with n=2 to:
data = pd.DataFrame([[[2, 4], [3, 5]], [[4, 6], [5, 7]], [[6, 8], [7, 9]], [[8, 10], [9, 11]], [[12, 9], [13, 8]], [[9, 8], [8, 9]],[[5, 3], [4, 3]]],
index=pd.MultiIndex.from_tuples([
(pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 01:00:00'), 0),
(pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 02:00:00'), 0),
(pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 03:00:00'), 0),
(pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 04:00:00'), 0),
(pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 01:00:00'), 0),
(pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 02:00:00'), 0),
(pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 05:00:00'), 0)
],
names=['dt_calc', 'dt_fore', 'positional_index']), columns=['temp', 'temp_2'])
推荐答案
我们可以定义一个生成器函数,它通过 dt_calc
列对数据框进行分组,并使用大小为 的窗口的滚动操作n
聚合要列出的列,从而产生序列.
We can define a generator function which groups the dataframe by the dt_calc
column and uses the rolling operation with window of size n
to aggregate the columns to list thereby yielding sequences.
def seq(n):
df = data.reset_index()
for g in df.groupby('dt_calc', sort=False).rolling(n):
yield g[data.columns].to_numpy().T if len(g) == n else []
pd.DataFrame(seq(2), index=data.index, columns=data.columns).dropna()
# n=2
temp temp_2
dt_calc dt_fore positional_index
2019-07-02 2019-07-02 01:00:00 0 [2, 4] [3, 5]
2019-07-02 02:00:00 0 [4, 6] [5, 7]
2019-07-02 03:00:00 0 [6, 8] [7, 9]
2019-07-02 04:00:00 0 [8, 10] [9, 11]
2019-07-04 2019-07-04 01:00:00 0 [12, 9] [13, 8]
2019-07-04 02:00:00 0 [9, 8] [8, 9]
2019-07-04 03:00:00 0 [8, 5] [9, 4]
2019-07-04 04:00:00 0 [5, 3] [4, 3]
# n=3
temp temp_2
dt_calc dt_fore positional_index
2019-07-02 2019-07-02 02:00:00 0 [2, 4, 6] [3, 5, 7]
2019-07-02 03:00:00 0 [4, 6, 8] [5, 7, 9]
2019-07-02 04:00:00 0 [6, 8, 10] [7, 9, 11]
2019-07-04 2019-07-04 02:00:00 0 [12, 9, 8] [13, 8, 9]
2019-07-04 03:00:00 0 [9, 8, 5] [8, 9, 4]
2019-07-04 04:00:00 0 [8, 5, 3] [9, 4, 3]
这篇关于带有预测数据的多索引数据帧上的 LSTM/RNN 预处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!