具有多次重复的 scikit-learn GridSearchCV [英] scikit-learn GridSearchCV with multiple repetitions
问题描述
我正在尝试为 SVR 模型获取最佳参数集.我想在 C
的不同值上使用 GridSearchCV
.但是,从之前的测试中,我注意到训练/测试集的拆分会极大地影响整体性能(在本例中为 r2).为了解决这个问题,我想实现一个重复的 5 折交叉验证 (10 x 5CV).是否有使用 GridSearchCV
执行它的内置方式?
I'm trying to get the best set of parameters for an SVR model.
I'd like to use the GridSearchCV
over different values of C
.
However, from previous test I noticed that the split into Training/Test set higlhy influence the overall performance (r2 in this instance).
To address this problem, I'd like to implement a repeated 5-fold cross validation (10 x 5CV). Is there a built in way of performing it using GridSearchCV
?
快速解决方案:
遵循 sci-kit 官方文档 中提出的想法,一个快速的解决方案是:
Following the idea presented in the sci-kit offical documentation , a quick solution is represented by:
NUM_TRIALS = 10
scores = []
for i in range(NUM_TRIALS):
cv = KFold(n_splits=5, shuffle=True, random_state=i)
clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=cv)
scores.append(clf.best_score_)
print "Average Score: {0} STD: {1}".format(numpy.mean(scores), numpy.std(scores))
推荐答案
这称为嵌套 cross_validation.您可以查看 官方文档示例 以指导您走向正确的方向,并且看看我的这里的其他答案以获得类似的方法.
This is called as nested cross_validation. You can look at official documentation example to guide you into right direction and also have a look at my other answer here for a similar approach.
您可以调整步骤以满足您的需要:
You can adapt the steps to suit your need:
svr = SVC(kernel="rbf")
c_grid = {"C": [1, 10, 100, ... ]}
# CV Technique "LabelKFold", "LeaveOneOut", "LeaveOneLabelOut", etc.
# To be used within GridSearch (5 in your case)
inner_cv = KFold(n_splits=5, shuffle=True, random_state=i)
# To be used in outer CV (you asked for 10)
outer_cv = KFold(n_splits=10, shuffle=True, random_state=i)
# Non_nested parameter search and scoring
clf = GridSearchCV(estimator=svr, param_grid=c_grid, cv=inner_cv)
clf.fit(X_iris, y_iris)
non_nested_score = clf.best_score_
# Pass the gridSearch estimator to cross_val_score
# This will be your required 10 x 5 cvs
# 10 for outer cv and 5 for gridSearch's internal CV
clf = GridSearchCV(estimator=svr, param_grid=c_grid, cv=inner_cv)
nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv).mean()
编辑 - 使用 cross_val_score()
和 GridSearchCV()
Edit - Description of nested cross validation with cross_val_score()
and GridSearchCV()
- clf = GridSearchCV(estimator, param_grid, cv=inner_cv).
- 将
clf、X、y、outer_cv
传递给cross_val_score
- 如源代码所示cross_val_score,这个
X
会用outer_cv
分成X_outer_train, X_outer_test
.y 也一样. X_outer_test
将被阻止,X_outer_train
将被传递给 clf for fit()(在我们的例子中是 GridSearchCV).假设X_outer_train
从这里开始被称为X_inner
,因为它被传递给内部估计器,假设y_outer_train
是y_inner
.X_inner
现在将使用 GridSearchCV 中的inner_cv
拆分为X_inner_train
和X_inner_test
.y 也一样- 现在 gridSearch 估计器将使用
X_inner_train
和y_train_inner
进行训练,并使用X_inner_test
和y_inner_test
进行评分. - 对于inner_cv_iters(在本例中为5),将重复第5 步和第6 步.
- 所有内部迭代
(X_inner_train, X_inner_test)
的平均得分最好的超参数被传递到clf.best_estimator_
并适合所有数据,即X_outer_train
. - 此
clf
(gridsearch.best_estimator_
) 然后将使用X_outer_test
和y_outer_test
进行评分. - 对于outer_cv_iters(此处为10)将重复步骤3到9,并且将从
cross_val_score
返回分数数组 - 然后我们使用 mean() 返回
nested_score
.
- clf = GridSearchCV(estimator, param_grid, cv= inner_cv).
- Pass
clf, X, y, outer_cv
tocross_val_score
- As seen in source code of cross_val_score, this
X
will be divided intoX_outer_train, X_outer_test
usingouter_cv
. Same for y. X_outer_test
will be held back andX_outer_train
will be passed on to clf for fit() (GridSearchCV in our case). AssumeX_outer_train
is calledX_inner
from here on since it is passed to inner estimator, assumey_outer_train
isy_inner
.X_inner
will now be split intoX_inner_train
andX_inner_test
usinginner_cv
in the GridSearchCV. Same for y- Now the gridSearch estimator will be trained using
X_inner_train
andy_train_inner
and scored usingX_inner_test
andy_inner_test
. - The steps 5 and 6 will be repeated for inner_cv_iters (5 in this case).
- The hyper-parameters for which the average score over all inner iterations
(X_inner_train, X_inner_test)
is best, is passed on to theclf.best_estimator_
and fitted for all data, i.e.X_outer_train
. - This
clf
(gridsearch.best_estimator_
) will then be scored usingX_outer_test
andy_outer_test
. - The steps 3 to 9 will be repeated for outer_cv_iters (10 here) and array of scores will returned from
cross_val_score
- We then use mean() to get back
nested_score
.
这篇关于具有多次重复的 scikit-learn GridSearchCV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!