带有预测数据的多索引数据帧上的 LSTM/RNN 预处理 [英] LSTM/RNN pre processing on MultiIndex DataFrame with forecast data

查看:31
本文介绍了带有预测数据的多索引数据帧上的 LSTM/RNN 预处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

RNN 和 LSTM 需要为每个特征数据点定义序列.

RNN and LSTM requires to define sequences for each feature data point.

预测数据(例如天气预报)的特点是具有计算时间戳和预测时间戳(此处为 dt_calcdt_fore).这样的数据可能会产生这样的数据框:

Forecast data (e.g. weather forecast) are characterized by having a calculation timestamp and a forecast timestamp (here dt_calc and dt_fore). Such data could yield to a dataframe like this:

data = pd.DataFrame([[2, 3], [4, 5], [6, 7], [8, 9], [10, 11], [12, 13], [9, 8], [8, 9], [5, 4], [3, 3]],
                    index=pd.MultiIndex.from_tuples([
                        (pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 00:00:00'), 0),
                        (pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 01:00:00'), 0),
                        (pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 02:00:00'), 0),
                        (pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 03:00:00'), 0),
                        (pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 04:00:00'), 0),
                        (pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 00:00:00'), 0),
                        (pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 01:00:00'), 0),
                        (pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 02:00:00'), 0),
                        (pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 03:00:00'), 0),
                        (pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 04:00:00'), 0)
                    ],
                        names=['dt_calc', 'dt_fore', 'positional_index']), columns=['temp', 'temp_2'])

对于长度为 2 的序列,用于 LSTM 或 RNN 的数据集应如下所示:

For a sequence length of 2 a dataset to use in LSTM or RNN should look like this:

data = pd.DataFrame([[[2, 4], [3, 5]], [[4, 6], [5, 7]], [[6, 8], [7, 9]], [[8, 10], [9, 11]], [[12, 9], [13, 8]], [[9, 8], [8, 9]], [[8, 5], [9, 4]], [[5, 3], [4, 3]]],
                    index=pd.MultiIndex.from_tuples([
                        (pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 01:00:00'), 0),
                        (pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 02:00:00'), 0),
                        (pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 03:00:00'), 0),
                        (pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 04:00:00'), 0),
                        (pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 01:00:00'), 0),
                        (pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 02:00:00'), 0),
                        (pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 03:00:00'), 0),
                        (pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 04:00:00'), 0)
                    ],
                        names=['dt_calc', 'dt_fore', 'positional_index']), columns=['temp', 'temp_2'])

这里是序列长度 3:

data = pd.DataFrame([[[2, 4, 6], [3, 5, 7]], [[4, 6, 8], [5, 7, 9]], [[6, 8, 10], [7, 9, 11]], [[12, 9, 8], [13, 8, 9]], [[9, 8, 5], [8, 9, 4]], [[8, 5, 3], [9, 4, 3]]],
                    index=pd.MultiIndex.from_tuples([
                        (pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 02:00:00'), 0),
                        (pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 03:00:00'), 0),
                        (pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 04:00:00'), 0),
                        (pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 02:00:00'), 0),
                        (pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 03:00:00'), 0),
                        (pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 04:00:00'), 0)
                    ],
                        names=['dt_calc', 'dt_fore', 'positional_index']), columns=['temp', 'temp_2'])

这个数据框可以很容易地转换成一个带有序列的 numpy 数组.

This dataframe can be transformed to a numpy array with sequences easiely.

这个问题的重要性在于注意时间戳,因为在这种情况下,序列是由时间段而不是索引定义的.

The importance within this question is to take care about the timestamps because a sequence is defined by a time period and not by index in this case.

在 Shubham Sharma 提出一个很好的建议之后:我将概述另一个例子来阐明考虑时间戳的重要性.因为如果 dt_fore 中的间隔不规则,则会出现以下输入:

After a good suggestion by Shubham Sharma: I will outline another example to clarify the importance of taking the timestamps into account. because in case of irregular intervals in dt_fore it come to the following input:

data = pd.DataFrame([[2, 3], [4, 5], [6, 7], [8, 9], [10, 11], [12, 13], [9, 8], [8, 9], [5, 4], [3, 3]],
                    index=pd.MultiIndex.from_tuples([
                        (pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 00:00:00'), 0),
                        (pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 01:00:00'), 0),
                        (pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 02:00:00'), 0),
                        (pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 03:00:00'), 0),
                        (pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 04:00:00'), 0),
                        (pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 00:00:00'), 0),
                        (pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 01:00:00'), 0),
                        (pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 02:00:00'), 0),
                        (pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 04:00:00'), 0),
                        (pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 05:00:00'), 0)
                    ],
                        names=['dt_calc', 'dt_fore', 'positional_index']), columns=['temp', 'temp_2'])

这应该为使用 n=2 的 LSTM/RNN 进行重组:

This should restructured for LSTM/RNN use with n=2 to:

data = pd.DataFrame([[[2, 4], [3, 5]], [[4, 6], [5, 7]], [[6, 8], [7, 9]], [[8, 10], [9, 11]], [[12, 9], [13, 8]], [[9, 8], [8, 9]],[[5, 3], [4, 3]]],
                    index=pd.MultiIndex.from_tuples([
                        (pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 01:00:00'), 0),
                        (pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 02:00:00'), 0),
                        (pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 03:00:00'), 0),
                        (pd.Timestamp('2019-07-02 00:00:00'), pd.Timestamp('2019-07-02 04:00:00'), 0),
                        (pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 01:00:00'), 0),
                        (pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 02:00:00'), 0),
                        (pd.Timestamp('2019-07-04 00:00:00'), pd.Timestamp('2019-07-04 05:00:00'), 0)
                    ],
                        names=['dt_calc', 'dt_fore', 'positional_index']), columns=['temp', 'temp_2'])

推荐答案

我们可以定义一个生成器函数,它通过 dt_calc 列对数据框进行分组,并使用大小为 的窗口的滚动操作n 聚合要列出的列,从而产生序列.

We can define a generator function which groups the dataframe by the dt_calc column and uses the rolling operation with window of size n to aggregate the columns to list thereby yielding sequences.

def seq(n):
    df = data.reset_index()
    for g in df.groupby('dt_calc', sort=False).rolling(n):
        yield g[data.columns].to_numpy().T if len(g) == n else []

pd.DataFrame(seq(2), index=data.index, columns=data.columns).dropna()


# n=2
                                                    temp   temp_2
dt_calc    dt_fore             positional_index                  
2019-07-02 2019-07-02 01:00:00 0                  [2, 4]   [3, 5]
           2019-07-02 02:00:00 0                  [4, 6]   [5, 7]
           2019-07-02 03:00:00 0                  [6, 8]   [7, 9]
           2019-07-02 04:00:00 0                 [8, 10]  [9, 11]
2019-07-04 2019-07-04 01:00:00 0                 [12, 9]  [13, 8]
           2019-07-04 02:00:00 0                  [9, 8]   [8, 9]
           2019-07-04 03:00:00 0                  [8, 5]   [9, 4]
           2019-07-04 04:00:00 0                  [5, 3]   [4, 3]

# n=3
                                                       temp      temp_2
dt_calc    dt_fore             positional_index                        
2019-07-02 2019-07-02 02:00:00 0                  [2, 4, 6]   [3, 5, 7]
           2019-07-02 03:00:00 0                  [4, 6, 8]   [5, 7, 9]
           2019-07-02 04:00:00 0                 [6, 8, 10]  [7, 9, 11]
2019-07-04 2019-07-04 02:00:00 0                 [12, 9, 8]  [13, 8, 9]
           2019-07-04 03:00:00 0                  [9, 8, 5]   [8, 9, 4]
           2019-07-04 04:00:00 0                  [8, 5, 3]   [9, 4, 3]

这篇关于带有预测数据的多索引数据帧上的 LSTM/RNN 预处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆