带有多个重复的scikit-学习GridSearchCV [英] scikit-learn GridSearchCV with multiple repetitions

查看:117
本文介绍了带有多个重复的scikit-学习GridSearchCV的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试为SVR模型获取最佳的参数集。
我想在 C 的不同值上使用 GridSearchCV
但是,从以前的测试中,我注意到将训练/测试集分为两个部分会影响整体性能(在本例中为r2)。
为了解决这个问题,我想实施重复的5倍交叉验证(10 x 5CV)。是否有使用 GridSearchCV 进行构建的内置方法?

I'm trying to get the best set of parameters for an SVR model. I'd like to use the GridSearchCV over different values of C. However, from previous test I noticed that the split into Training/Test set higlhy influence the overall performance (r2 in this instance). To address this problem, I'd like to implement a repeated 5-fold cross validation (10 x 5CV). Is there a built in way of performing it using GridSearchCV?

快速解决方案:

遵循sci-kit中提出的想法官方文档,则快速解决方案表示为:

Following the idea presented in the sci-kit offical documentation , a quick solution is represented by:

NUM_TRIALS = 10
scores = []
for i in range(NUM_TRIALS):
     cv = KFold(n_splits=5, shuffle=True, random_state=i)
     clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=cv)
     scores.append(clf.best_score_)
print "Average Score: {0} STD: {1}".format(numpy.mean(scores), numpy.std(scores))


推荐答案

这称为嵌套cross_validation。您可以查看官方文档示例,以指导您正确的方向,以及看看我的此处的其他答案,以了解类似的方法。

This is called as nested cross_validation. You can look at official documentation example to guide you into right direction and also have a look at my other answer here for a similar approach.

您可以根据需要调整步骤:

You can adapt the steps to suit your need:

svr = SVC(kernel="rbf")
c_grid = {"C": [1, 10, 100, ...  ]}

# CV Technique "LabelKFold", "LeaveOneOut", "LeaveOneLabelOut", etc.

# To be used within GridSearch (5 in your case)
inner_cv = KFold(n_splits=5, shuffle=True, random_state=i)

# To be used in outer CV (you asked for 10)
outer_cv = KFold(n_splits=10, shuffle=True, random_state=i)

# Non_nested parameter search and scoring
clf = GridSearchCV(estimator=svr, param_grid=c_grid, cv=inner_cv)
clf.fit(X_iris, y_iris)
non_nested_score = clf.best_score_

# Pass the gridSearch estimator to cross_val_score
# This will be your required 10 x 5 cvs
# 10 for outer cv and 5 for gridSearch's internal CV
clf = GridSearchCV(estimator=svr, param_grid=c_grid, cv=inner_cv)
nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv).mean()

编辑-使用 cross_val_score()和<$的嵌套交叉验证的说明c $ c> GridSearchCV()

Edit - Description of nested cross validation with cross_val_score() and GridSearchCV()


  1. clf = GridSearchCV(estimator,param_grid,cv = inner_cv)。

  2. clf,X,y,outer_cv 传递给 cross_val_score

  3. cross_val_score 的源代码,此 X 将使用<$ c分为 X_outer_train,X_outer_test $ c> outer_cv 。与y相同。

  4. X_outer_test 将被保留,而 X_outer_train 将被保留。传递给clf适合fit()(在我们的例子中为GridSearchCV)。 假设 X_outer_train 从此处开始被称为 X_inner ,因为它已传递给内部估算器,假设 y_outer_train y_inner

  5. X_inner 现在将使用 inner_cv X_inner_train X_inner_test c $ c>在GridSearchCV中。与y

  6. 相同,现在将使用 X_inner_train y_train_inner ,并使用 X_inner_test y_inner_test 进行评分。

  7. 步骤将对inner_cv_iters重复5和6 (在本例中为5)。

  8. 超参数,所有内部迭代的平均得分( X_inner_train,X_inner_test)最好,传递给 clf.best_estimator _ 并适合所有数据,即 X_outer_train

  9. clf gridsearch.best_estimator _ ),然后使用 X_outer_test y_outer_test 进行评分。

  10. >针对external_cv_iters(此处为10)将重复步骤3至9 ,分数数组将从 cross_val_score

  11. 返回然后,我们使用mean()返回 nested_score

  1. clf = GridSearchCV(estimator, param_grid, cv= inner_cv).
  2. Pass clf, X, y, outer_cv to cross_val_score
  3. As seen in source code of cross_val_score, this X will be divided into X_outer_train, X_outer_test using outer_cv. Same for y.
  4. X_outer_test will be held back and X_outer_train will be passed on to clf for fit() (GridSearchCV in our case). Assume X_outer_train is called X_inner from here on since it is passed to inner estimator, assume y_outer_train is y_inner.
  5. X_inner will now be split into X_inner_train and X_inner_test using inner_cv in the GridSearchCV. Same for y
  6. Now the gridSearch estimator will be trained using X_inner_train and y_train_inner and scored using X_inner_test and y_inner_test.
  7. The steps 5 and 6 will be repeated for inner_cv_iters (5 in this case).
  8. The hyper-parameters for which the average score over all inner iterations (X_inner_train, X_inner_test) is best, is passed on to the clf.best_estimator_ and fitted for all data, i.e. X_outer_train.
  9. This clf (gridsearch.best_estimator_) will then be scored using X_outer_test and y_outer_test.
  10. The steps 3 to 9 will be repeated for outer_cv_iters (10 here) and array of scores will returned from cross_val_score
  11. We then use mean() to get back nested_score.

这篇关于带有多个重复的scikit-学习GridSearchCV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆