在使用sklearn的嵌套交叉验证中使用GroupKFold [英] Use GroupKFold in nested cross-validation using sklearn

查看:230
本文介绍了在使用sklearn的嵌套交叉验证中使用GroupKFold的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的代码基于sklearn网站上的示例: https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html

my code is based on the example on the sklearn website: https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html

我正在尝试在内部和外部简历中使用GroupKFold.

I am trying to use GroupKFold in the inner and outer cv.

from sklearn.datasets import load_iris
from matplotlib import pyplot as plt
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold,GroupKFold
import numpy as np

# Load the dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

# Set up possible values of parameters to optimize over
p_grid = {"C": [1, 10, 100],
          "gamma": [.01, .1]}

# We will use a Support Vector Classifier with "rbf" kernel
svm = SVC(kernel="rbf")

# Choose cross-validation techniques for the inner and outer loops,
# independently of the dataset.
# E.g "GroupKFold", "LeaveOneOut", "LeaveOneGroupOut", etc.
inner_cv = GroupKFold(n_splits=3)
outer_cv = GroupKFold(n_splits=3)

# Non_nested parameter search and scoring
clf = GridSearchCV(estimator=svm, param_grid=p_grid, cv=inner_cv)

# Nested CV with parameter optimization
nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv, groups=y_iris)

我知道将y值放入groups参数并不是用来做的!!对于此代码,我收到以下错误.

I know that putting the y values into the groups argument is not what it is used for!! For this code I get the following error.

.../anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
ValueError: The 'groups' parameter should not be None.

Ayone对如何解决这个问题有想法吗?

Does ayone have an idea on how to solve this?

谢谢您的帮助,

瑟伦

推荐答案

我遇到了类似的问题,我发现@Samalama的解决方案是一个很好的解决方案.我唯一需要更改的是在 fit 调用中.我还必须对 groups 进行切片,并使用与火车组相同的 X y 形状.否则,我将收到一个错误消息,指出三个对象的形状不相同.那是正确的实现吗?

I came across a similar problem and I found the solution of @Samalama as a good one. The only thing I needed to change was in the fit call. I had to slice the groups too, with the same shape of the X and y for the train set. Otherwise, I get an error saying that shapes of the three objects are not the same. Is that a correct implementation?

for train_index, test_index in outer_cv.split(x, y, groups=groups):
    x_train, x_test = x[train_index], x[test_index]
    y_train, y_test = y[train_index], y[test_index]

    grid = RandomizedSearchCV(estimator=model,
                                param_distributions=parameters_grid,
                                cv=inner_cv,
                                scoring=get_scoring(),
                                refit='roc_auc_scorer',
                                return_train_score=True,
                                verbose=1,
                                n_jobs=jobs)
    grid.fit(x_train, y_train, groups=groups[train_index])
    prediction = grid.predict(x_test)

这篇关于在使用sklearn的嵌套交叉验证中使用GroupKFold的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆