使用验证窗口向前走,用于时间序列数据交叉验证 [英] Walk Forward with validation window for time series data cross validation

查看:75
本文介绍了使用验证窗口向前走,用于时间序列数据交叉验证的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望对我的时间序列数据执行前向验证.关于如何执行滚动窗口的大量文档:

扩展窗口

但此验证与我的生产系统中的内容不符:我想每天重新训练一个模型,该模型将在未来 14 天进行预测.因此,我只会将 一天 的数据添加到我之前的训练期间(其他方法在以下训练中添加了一组长度为 test_size 的数据;14 天在我的情况下).因此,我想使用滑动窗口来验证我的模型:

我的问题是我找不到可以完成这项工作的 Python 库.来自 sklearn 的

(要获得相同的结果,请确保更改
for ii, (tr, tt) in enumerate(**cv.overlapping_split**(X=X, y=y, groups=group)):
plot_cv_indices 函数中.

干杯!

I'm looking to perform walk forward validation on my time-series data. Extensive document exists on how to perform rolling window:

or expanding window

But this validation does not correspond to what will be in my production system: I want to daily retrain a model that will make prediction 14 days in the future. So I would only add one day of data to my previous training period (where the other methods add on the following training folds an entire set of data of length test_size; 14 days in my case). Therefore, I would like to validate my model with a sliding window:

My question is that I can't come across a Python library that would do the work. TimeSeriesSplit from sklearn has no option of that kind. Basically I want to provide :
test_size, n_fold, min_train_size and

if n_fold > (n_samples - min_train_size) % test_size then next training_set draw data from the previous fold test_set

解决方案

Here is my solution that allows the user to specify the testing horizon and the minimum sample of data for training:

from sklearn.model_selection import TimeSeriesSplit
from sklearn.utils import indexable
from sklearn.utils.validation import _num_samples

class TimeSeriesSplitCustom(TimeSeriesSplit):
    def __init__(self, n_splits=5, max_train_size=None,
                 test_size=1,
                 min_train_size=1):
        super().__init__(n_splits=n_splits, max_train_size=max_train_size)
        self.test_size = test_size
        self.min_train_size = min_train_size

    def overlapping_split(self, X, y=None, groups=None):
        min_train_size = self.min_train_size
        test_size = self.test_size

        n_splits = self.n_splits
        n_samples = _num_samples(X)

        if (n_samples - min_train_size) / test_size >= n_splits:
            print('(n_samples -  min_train_size) / test_size >= n_splits')
            print('default TimeSeriesSplit.split() used')
            yield from super().split(X)

        else:
            shift = int(np.floor(
                (n_samples - test_size - min_train_size) / (n_splits - 1)))

            start_test = n_samples - (n_splits * shift + test_size - shift)

            test_starts = range(start_test, n_samples - test_size + 1, shift)

            if start_test < min_train_size:
                raise ValueError(
                    ("The start of the testing : {0} is smaller"
                     " than the minimum training samples: {1}.").format(start_test,
                                                                        min_train_size))

            indices = np.arange(n_samples)

            for test_start in test_starts:
                if self.max_train_size and self.max_train_size < test_start:
                    yield (indices[test_start - self.max_train_size:test_start],
                           indices[test_start:test_start + test_size])
                else:
                    yield (indices[:test_start],
                           indices[test_start:test_start + test_size])

And with the visualisation:

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
from ModelEvaluation import TimeSeriesSplitCustom
np.random.seed(1338)
cmap_data = plt.cm.Paired
cmap_cv = plt.cm.coolwarm
n_splits = 13

# Generate the class/group data
n_points = 100
X = np.random.randn(100, 10)

percentiles_classes = [.1, .3, .6]
y = np.hstack([[ii] * int(100 * perc)
               for ii, perc in enumerate(percentiles_classes)])

# Evenly spaced groups repeated once
groups = np.hstack([[ii] * 10 for ii in range(10)])

fig, ax = plt.subplots()

cv = TimeSeriesSplitCustom(n_splits=n_splits, test_size=20, min_train_size=12)
plot_cv_indices(cv, X, y, groups, ax, n_splits)
plt.show()

(To have the same result, make sure to change the
for ii, (tr, tt) in enumerate(**cv.overlapping_split**(X=X, y=y, groups=group)):
in the plot_cv_indices function.

Cheers!

这篇关于使用验证窗口向前走,用于时间序列数据交叉验证的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆