分组时间序列(面板)数据的交叉验证 [英] Cross-validation for grouped time-series (panel) data

查看：48 发布时间：2021/7/16 20:04:50 python-3.x scikit-learn time-series cross-validation panel-data

本文介绍了分组时间序列(面板)数据的交叉验证的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用面板数据:随着时间的推移，我观察了许多单位(例如人)；对于每个单元，我都有相同固定时间间隔的记录.

I work with panel data: I observe a number of units (e.g. people) over time; for each unit, I have records for the same fixed time intervals.

在将数据拆分为训练集和测试集时，我们需要确保两个集是不相交的并且顺序，即训练集中的最新记录应该在测试中最早的记录之前设置(例如参见此博文).

When splitting the data into train and test sets, we need to make sure that both sets are disjoint and sequential, i.e. the latest records in the train set should be before the earliest records in the test set (see e.g. this blog post).

是否有针对面板数据进行交叉验证的标准 Python 实现?

我试过 Scikit-Learn 的 TimeSeriesSplit，不能考虑组，而 GroupShuffleSplit 不能考虑到数据的顺序性，请参见下面的代码.

I've tried Scikit-Learn's TimeSeriesSplit, which cannot account for groups, and GroupShuffleSplit which cannot account for the sequential nature of the data, see code below.

import pandas as pd
import numpy as np
from sklearn.model_selection import GroupShuffleSplit, TimeSeriesSplit

# generate panel data
user = np.repeat(np.arange(10), 12)
time = np.tile(pd.date_range(start='2018-01-01', periods=12, freq='M'), 10)
data = (pd.DataFrame({'user': user, 'time': time})
        .sort_values(['time', 'user'])
        .reset_index(drop=True))

tscv = TimeSeriesSplit(n_splits=4)
for train_idx, test_idx in tscv.split(data):
    train = data.iloc[train_idx]
    test = data.iloc[test_idx]
    train_end = train.time.max().date()
    test_start = test.time.min().date()
    print('TRAIN:', train_end, '\tTEST:', test_start, '\tSequential:', train_end < test_start, sep=' ')

输出:

TRAIN: 2018-03-31   TEST: 2018-03-31    Sequential: False
TRAIN: 2018-05-31   TEST: 2018-05-31    Sequential: False
TRAIN: 2018-08-31   TEST: 2018-08-31    Sequential: False
TRAIN: 2018-10-31   TEST: 2018-10-31    Sequential: False

所以，在这个例子中，我希望训练集和测试集仍然是连续的.

So, in this example, I want the train and test sets to still be sequential.

有许多相关的旧帖子，但没有(令人信服的)答案，请参见例如

There are a number of related, older posts, but with no (convincing) answer, see e.g.

推荐答案

我最近遇到了同样的任务，在找不到合适的解决方案后，我决定编写自己的类，它是 scikit-learn TimeSeriesSplit 实现.因此，我将离开这里，供以后寻找解决方案的人使用.

I have recently hit the same task and after I failed to find appropriate solution I decided to write my own class which is a tweaked version of scikit-learn TimeSeriesSplit implementation. Therefore, I'll leave here for whoever comes later looking for the solution.

这个想法基本上是按time对data进行排序，根据time变量对观察进行分组，然后构建交叉验证器与 TimeSeriesSplit 相同，但在新形成的观察组上.

The idea is basically to sort the data by time, group the observations according to time variable and then just build cross-validator the same way TimeSeriesSplit does, but on the newly formed groups of observations.

import numpy as np
from sklearn.utils import indexable
from sklearn.utils.validation import _num_samples
from sklearn.model_selection._split import _BaseKFold

class GroupTimeSeriesSplit(_BaseKFold):
    """
    Time Series cross-validator for a variable number of observations within the time 
    unit. In the kth split, it returns first k folds as train set and the (k+1)th fold 
    as test set. Indices can be grouped so that they enter the CV fold together.

    Parameters
    ----------
    n_splits : int, default=5
        Number of splits. Must be at least 2.
    max_train_size : int, default=None
        Maximum size for a single training set.
    """
    def __init__(self, n_splits=5, *, max_train_size=None):
        super().__init__(n_splits, shuffle=False, random_state=None)
        self.max_train_size = max_train_size

    def split(self, X, y=None, groups=None):
        """
        Generate indices to split data into training and test set.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Training data, where n_samples is the number of samples and n_features is 
            the number of features.
        y : array-like of shape (n_samples,)
            Always ignored, exists for compatibility.
        groups : array-like of shape (n_samples,)
            Group labels for the samples used while splitting the dataset into 
            train/test set.
            Most often just a time feature.

        Yields
        -------
        train : ndarray
            The training set indices for that split.
        test : ndarray
            The testing set indices for that split.
        """
        n_splits = self.n_splits
        X, y, groups = indexable(X, y, groups)
        n_samples = _num_samples(X)
        n_folds = n_splits + 1
        indices = np.arange(n_samples)
        group_counts = np.unique(groups, return_counts=True)[1]
        groups = np.split(indices, np.cumsum(group_counts)[:-1])
        n_groups = _num_samples(groups)
        if n_folds > n_groups:
            raise ValueError(
                ("Cannot have number of folds ={0} greater"
                 " than the number of groups: {1}.").format(n_folds, n_groups))
        test_size = (n_groups // n_folds)
        test_starts = range(test_size + n_groups % n_folds,
                            n_groups, test_size)
        for test_start in test_starts:
            if self.max_train_size:
                train_start = np.searchsorted(
                    np.cumsum(
                        group_counts[:test_start][::-1])[::-1] < self.max_train_size + 1, 
                        True)
                yield (np.concatenate(groups[train_start:test_start]),
                       np.concatenate(groups[test_start:test_start + test_size]))
            else:
                yield (np.concatenate(groups[:test_start]),
                       np.concatenate(groups[test_start:test_start + test_size]))

并将其应用于 OP 示例，我们得到:

And applying it to OP example we get:

gtscv = GroupTimeSeriesSplit(n_splits=3)
for split_id, (train_id, val_id) in enumerate(gtscv.split(data, groups=data["time"])):
    print("Split id: ", split_id, "\n") 
    print("Train id: ", train_id, "\n", "Validation id: ", val_id)
    print("Train dates: ", data.loc[train_id, "time"].unique(), "\n", "Validation dates: ", data.loc[val_id, "time"].unique(), "\n")

Split id:  0 

Train id:  [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29] 
 Validation id:  [30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
 54 55 56 57 58 59]
Train dates:  ['2018-01-31T00:00:00.000000000' '2018-02-28T00:00:00.000000000'
 '2018-03-31T00:00:00.000000000'] 
 Validation dates:  ['2018-04-30T00:00:00.000000000' '2018-05-31T00:00:00.000000000'
 '2018-06-30T00:00:00.000000000'] 

Split id:  1 

Train id:  [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59] 
 Validation id:  [60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83
 84 85 86 87 88 89]
Train dates:  ['2018-01-31T00:00:00.000000000' '2018-02-28T00:00:00.000000000'
 '2018-03-31T00:00:00.000000000' '2018-04-30T00:00:00.000000000'
 '2018-05-31T00:00:00.000000000' '2018-06-30T00:00:00.000000000'] 
 Validation dates:  ['2018-07-31T00:00:00.000000000' '2018-08-31T00:00:00.000000000'
 '2018-09-30T00:00:00.000000000'] 

Split id:  2 

Train id:  [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89] 
 Validation id:  [ 90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119]
Train dates:  ['2018-01-31T00:00:00.000000000' '2018-02-28T00:00:00.000000000'
 '2018-03-31T00:00:00.000000000' '2018-04-30T00:00:00.000000000'
 '2018-05-31T00:00:00.000000000' '2018-06-30T00:00:00.000000000'
 '2018-07-31T00:00:00.000000000' '2018-08-31T00:00:00.000000000'
 '2018-09-30T00:00:00.000000000'] 
 Validation dates:  ['2018-10-31T00:00:00.000000000' '2018-11-30T00:00:00.000000000'
 '2018-12-31T00:00:00.000000000']

这篇关于分组时间序列(面板)数据的交叉验证的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

分组时间序列(面板)数据的交叉验证 [英] Cross-validation for grouped time-series (panel) data

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

分组时间序列(面板)数据的交叉验证 [英] Cross-validation for grouped time-series (panel) data

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭