带有多个重复的scikit-学习GridSearchCV [英] scikit-learn GridSearchCV with multiple repetitions
问题描述
我正在尝试为SVR模型获取最佳的参数集。
我想在 C
的不同值上使用 GridSearchCV
。
但是,从以前的测试中,我注意到将训练/测试集分为两个部分会影响整体性能(在本例中为r2)。
为了解决这个问题,我想实施重复的5倍交叉验证(10 x 5CV)。是否有使用 GridSearchCV
进行构建的内置方法?
I'm trying to get the best set of parameters for an SVR model.
I'd like to use the GridSearchCV
over different values of C
.
However, from previous test I noticed that the split into Training/Test set higlhy influence the overall performance (r2 in this instance).
To address this problem, I'd like to implement a repeated 5-fold cross validation (10 x 5CV). Is there a built in way of performing it using GridSearchCV
?
快速解决方案:
遵循sci-kit中提出的想法官方文档,则快速解决方案表示为:
Following the idea presented in the sci-kit offical documentation , a quick solution is represented by:
NUM_TRIALS = 10
scores = []
for i in range(NUM_TRIALS):
cv = KFold(n_splits=5, shuffle=True, random_state=i)
clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=cv)
scores.append(clf.best_score_)
print "Average Score: {0} STD: {1}".format(numpy.mean(scores), numpy.std(scores))
推荐答案
这称为嵌套cross_validation。您可以查看官方文档示例,以指导您正确的方向,以及看看我的此处的其他答案,以了解类似的方法。
This is called as nested cross_validation. You can look at official documentation example to guide you into right direction and also have a look at my other answer here for a similar approach.
您可以根据需要调整步骤:
You can adapt the steps to suit your need:
svr = SVC(kernel="rbf")
c_grid = {"C": [1, 10, 100, ... ]}
# CV Technique "LabelKFold", "LeaveOneOut", "LeaveOneLabelOut", etc.
# To be used within GridSearch (5 in your case)
inner_cv = KFold(n_splits=5, shuffle=True, random_state=i)
# To be used in outer CV (you asked for 10)
outer_cv = KFold(n_splits=10, shuffle=True, random_state=i)
# Non_nested parameter search and scoring
clf = GridSearchCV(estimator=svr, param_grid=c_grid, cv=inner_cv)
clf.fit(X_iris, y_iris)
non_nested_score = clf.best_score_
# Pass the gridSearch estimator to cross_val_score
# This will be your required 10 x 5 cvs
# 10 for outer cv and 5 for gridSearch's internal CV
clf = GridSearchCV(estimator=svr, param_grid=c_grid, cv=inner_cv)
nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv).mean()
编辑-使用 cross_val_score()
和<$的嵌套交叉验证的说明c $ c> GridSearchCV()
Edit - Description of nested cross validation with cross_val_score()
and GridSearchCV()
- clf = GridSearchCV(estimator,param_grid,cv = inner_cv)。
- 将
clf,X,y,outer_cv
传递给cross_val_score
- 如 cross_val_score 的源代码,此
X
将使用<$ c分为X_outer_train,X_outer_test
$ c> outer_cv 。与y相同。 -
X_outer_test
将被保留,而X_outer_train
将被保留。传递给clf适合fit()(在我们的例子中为GridSearchCV)。 假设X_outer_train
从此处开始被称为X_inner
,因为它已传递给内部估算器,假设y_outer_train
是y_inner
。 -
X_inner
现在将使用inner_cv $分为
X_inner_train
和X_inner_test
c $ c>在GridSearchCV中。与y - 相同,现在将使用
X_inner_train
和y_train_inner
,并使用X_inner_test
和y_inner_test
进行评分。 - 步骤将对inner_cv_iters重复5和6 (在本例中为5)。
- 超参数,所有内部迭代的平均得分
( X_inner_train,X_inner_test)
最好,传递给clf.best_estimator _
并适合所有数据,即X_outer_train
。 - 此
clf
(gridsearch.best_estimator _
),然后使用X_outer_test
和y_outer_test
进行评分。 - >针对external_cv_iters(此处为10)将重复步骤3至9 ,分数数组将从
cross_val_score
- 返回然后,我们使用mean()返回
nested_score
。
- clf = GridSearchCV(estimator, param_grid, cv= inner_cv).
- Pass
clf, X, y, outer_cv
tocross_val_score
- As seen in source code of cross_val_score, this
X
will be divided intoX_outer_train, X_outer_test
usingouter_cv
. Same for y. X_outer_test
will be held back andX_outer_train
will be passed on to clf for fit() (GridSearchCV in our case). AssumeX_outer_train
is calledX_inner
from here on since it is passed to inner estimator, assumey_outer_train
isy_inner
.X_inner
will now be split intoX_inner_train
andX_inner_test
usinginner_cv
in the GridSearchCV. Same for y- Now the gridSearch estimator will be trained using
X_inner_train
andy_train_inner
and scored usingX_inner_test
andy_inner_test
. - The steps 5 and 6 will be repeated for inner_cv_iters (5 in this case).
- The hyper-parameters for which the average score over all inner iterations
(X_inner_train, X_inner_test)
is best, is passed on to theclf.best_estimator_
and fitted for all data, i.e.X_outer_train
. - This
clf
(gridsearch.best_estimator_
) will then be scored usingX_outer_test
andy_outer_test
. - The steps 3 to 9 will be repeated for outer_cv_iters (10 here) and array of scores will returned from
cross_val_score
- We then use mean() to get back
nested_score
.
这篇关于带有多个重复的scikit-学习GridSearchCV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!