在 GridSearchCV 中显式指定测试/训练集 [英] Explicitly specifying test/train sets in GridSearchCV

查看:28
本文介绍了在 GridSearchCV 中显式指定测试/训练集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对 sklearn 的 cv 参数有疑问="noreferrer">GridSearchCV.

I have a question about the cv parameter of sklearn's GridSearchCV.

我正在处理具有时间分量的数据,因此我认为 KFold 交叉验证中的随机改组似乎不合理.

I'm working with data that has a time component to it, so I don't think random shuffling within KFold cross-validation seems sensible.

相反,我想在 GridSearchCV 中明确指定训练、验证和测试数据的截止值.我可以这样做吗?

Instead, I want to explicitly specify cutoffs for training, validation, and test data within a GridSearchCV. Can I do this?

为了更好地阐明这个问题,以下是我手动解决的方法.

To better illuminate the question, here's how I would to that manually.

import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge
np.random.seed(444)

index = pd.date_range('2014', periods=60, freq='M')
X, y = make_regression(n_samples=60, n_features=3, random_state=444, noise=90.)
X = pd.DataFrame(X, index=index, columns=list('abc'))
y = pd.Series(y, index=index, name='y')

# Train on the first 30 samples, validate on the next 10, test on
#     the final 10.
X_train, X_val, X_test = np.array_split(X, [35, 50])
y_train, y_val, y_test = np.array_split(y, [35, 50])

param_grid = {'alpha': np.linspace(0, 1, 11)}
model = None
best_param_ = None
best_score_ = -np.inf

# Manual implementation
for alpha in param_grid['alpha']:
    ridge = Ridge(random_state=444, alpha=alpha).fit(X_train, y_train)
    score = ridge.score(X_val, y_val)
    if score > best_score_:
        best_score_ = score
        best_param_ = alpha
        model = ridge

print('Optimal alpha parameter: {:0.2f}'.format(best_param_))
print('Best score (on validation data): {:0.2f}'.format(best_score_))
print('Test set score: {:.2f}'.format(model.score(X_test, y_test)))
# Optimal alpha parameter: 1.00
# Best score (on validation data): 0.64
# Test set score: 0.22

这里的流程是:

  • 对于 X 和 Y,我想要一个训练集、验证集和测试集.训练集是时间序列中的前 35 个样本.验证集是接下来的 15 个样本.测试集是最后的 10 个.
  • 训练集和验证集用于确定岭回归中的最佳 alpha 参数.在这里,我测试了 (0.0, 0.1, ..., 0.9, 1.0) 的 alphas.
  • 测试集作为未见数据保留用于实际"测试.
  • For both X and Y, I want a training set, validation set, and testing set. The training set is the first 35 samples in the time series. The validation set is the next 15 samples. The test set is the final 10.
  • The train and validation sets are use to determine the optimal alpha parameter within Ridge regression. Here I test alphas of (0.0, 0.1, ..., 0.9, 1.0).
  • The test set is held out for the "actual" testing as unseen data.

无论如何......似乎我想做这样的事情,但我不确定要传递给cv这里的内容:

Anyways ... It seems like I'm looking to do something like this, but am not sure what to pass to cv here:

from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(Ridge(random_state=444), param_grid, cv= ???)
grid_search.fit(...?)

我无法解释的文档指定:

The docs, which I'm having trouble interpreting, specify:

cv : int,交叉验证生成器或可迭代的,可选的

cv : int, cross-validation generator or an iterable, optional

确定交叉验证拆分策略.可能的输入简历是:

Determines the cross-validation splitting strategy. Possible inputs for cv are:

  • 无,使用默认的三折交叉验证,
  • 整数,指定(分层)KFold 中的折叠数,
  • 用作交叉验证生成器的对象.
  • 一个可迭代的 yield 训练,测试分割.

对于整数/无输入,如果估计器是分类器并且 y 是无论是二进制还是多类,都使用 StratifiedKFold.在所有其他情况下,使用 KFold.

For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.

推荐答案

正如@MaxU 所说,最好让 GridSearchCV 处理拆分,但是如果您想按照问题中的设置强制拆分,那么您可以使用 PredefinedSplit正是这样做的.

As @MaxU said, its better to let the GridSearchCV handle the splits, but if you want to enforce the splitting as you have set in the question, then you can use the PredefinedSplit which does this very thing.

因此您需要对代码进行以下更改.

So you need to make the following changes to your code.

# Here X_test, y_test is the untouched data
# Validation data (X_val, y_val) is currently inside X_train, which will be split using PredefinedSplit inside GridSearchCV
X_train, X_test = np.array_split(X, [50])
y_train, y_test = np.array_split(y, [50])


# The indices which have the value -1 will be kept in train.
train_indices = np.full((35,), -1, dtype=int)

# The indices which have zero or positive values, will be kept in test
test_indices = np.full((15,), 0, dtype=int)
test_fold = np.append(train_indices, test_indices)

print(test_fold)
# OUTPUT: 
array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0])

from sklearn.model_selection import PredefinedSplit
ps = PredefinedSplit(test_fold)

# Check how many splits will be done, based on test_fold
ps.get_n_splits()
# OUTPUT: 1

for train_index, test_index in ps.split():
    print("TRAIN:", train_index, "TEST:", test_index)

# OUTPUT: 
('TRAIN:', array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
   17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
   34]), 
 'TEST:', array([35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]))


# And now, send this `ps` to cv param in GridSearchCV
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(Ridge(random_state=444), param_grid, cv=ps)

# Here, send the X_train and y_train
grid_search.fit(X_train, y_train)

发送到 fit() 的 X_train, y_train 将使用我们定义的拆分分成训练和测试(在您的情况下为 val),因此,Ridge 将根据来自索引的原始数据进行训练[0:35] 并在 [35:50] 上测试.

The X_train, y_train sent to fit() will be split into train and test (val in your case) using the split we defined and hence, the Ridge will be trained on original data from indices [0:35] and tested on [35:50].

希望这可以清除工作.

这篇关于在 GridSearchCV 中显式指定测试/训练集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆