在使用sklearn的嵌套交叉验证中使用GroupKFold [英] Use GroupKFold in nested cross-validation using sklearn
问题描述
我的代码基于sklearn网站上的示例: https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html
my code is based on the example on the sklearn website: https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html
我正在尝试在内部和外部简历中使用GroupKFold.
I am trying to use GroupKFold in the inner and outer cv.
from sklearn.datasets import load_iris
from matplotlib import pyplot as plt
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold,GroupKFold
import numpy as np
# Load the dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target
# Set up possible values of parameters to optimize over
p_grid = {"C": [1, 10, 100],
"gamma": [.01, .1]}
# We will use a Support Vector Classifier with "rbf" kernel
svm = SVC(kernel="rbf")
# Choose cross-validation techniques for the inner and outer loops,
# independently of the dataset.
# E.g "GroupKFold", "LeaveOneOut", "LeaveOneGroupOut", etc.
inner_cv = GroupKFold(n_splits=3)
outer_cv = GroupKFold(n_splits=3)
# Non_nested parameter search and scoring
clf = GridSearchCV(estimator=svm, param_grid=p_grid, cv=inner_cv)
# Nested CV with parameter optimization
nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv, groups=y_iris)
我知道将y值放入groups参数并不是用来做的!!对于此代码,我收到以下错误.
I know that putting the y values into the groups argument is not what it is used for!! For this code I get the following error.
.../anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
ValueError: The 'groups' parameter should not be None.
Ayone对如何解决这个问题有想法吗?
Does ayone have an idea on how to solve this?
谢谢您的帮助,
瑟伦
推荐答案
我遇到了类似的问题,我发现@Samalama的解决方案是一个很好的解决方案.我唯一需要更改的是在 fit
调用中.我还必须对 groups
进行切片,并使用与火车组相同的 X
和 y
形状.否则,我将收到一个错误消息,指出三个对象的形状不相同.那是正确的实现吗?
I came across a similar problem and I found the solution of @Samalama as a good one.
The only thing I needed to change was in the fit
call. I had to slice the groups
too, with the same shape of the X
and y
for the train set.
Otherwise, I get an error saying that shapes of the three objects are not the same. Is that a correct implementation?
for train_index, test_index in outer_cv.split(x, y, groups=groups):
x_train, x_test = x[train_index], x[test_index]
y_train, y_test = y[train_index], y[test_index]
grid = RandomizedSearchCV(estimator=model,
param_distributions=parameters_grid,
cv=inner_cv,
scoring=get_scoring(),
refit='roc_auc_scorer',
return_train_score=True,
verbose=1,
n_jobs=jobs)
grid.fit(x_train, y_train, groups=groups[train_index])
prediction = grid.predict(x_test)
这篇关于在使用sklearn的嵌套交叉验证中使用GroupKFold的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!