具有多次重复的 scikit-learn GridSearchCV [英] scikit-learn GridSearchCV with multiple repetitions

查看:23
本文介绍了具有多次重复的 scikit-learn GridSearchCV的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试为 SVR 模型获取最佳参数集.我想在 C 的不同值上使用 GridSearchCV.但是,从之前的测试中,我注意到训练/测试集的拆分会极大地影响整体性能(在本例中为 r2).为了解决这个问题,我想实现一个重复的 5 折交叉验证 (10 x 5CV).是否有使用 GridSearchCV 执行它的内置方式?

I'm trying to get the best set of parameters for an SVR model. I'd like to use the GridSearchCV over different values of C. However, from previous test I noticed that the split into Training/Test set higlhy influence the overall performance (r2 in this instance). To address this problem, I'd like to implement a repeated 5-fold cross validation (10 x 5CV). Is there a built in way of performing it using GridSearchCV?

快速解决方案:

遵循 sci-kit 官方文档 中提出的想法,一个快速的解决方案是:

Following the idea presented in the sci-kit offical documentation , a quick solution is represented by:

NUM_TRIALS = 10
scores = []
for i in range(NUM_TRIALS):
     cv = KFold(n_splits=5, shuffle=True, random_state=i)
     clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=cv)
     scores.append(clf.best_score_)
print "Average Score: {0} STD: {1}".format(numpy.mean(scores), numpy.std(scores))

推荐答案

这称为嵌套 cross_validation.您可以查看 官方文档示例 以指导您走向正确的方向,并且看看我的这里的其他答案以获得类似的方法.

This is called as nested cross_validation. You can look at official documentation example to guide you into right direction and also have a look at my other answer here for a similar approach.

您可以调整步骤以满足您的需要:

You can adapt the steps to suit your need:

svr = SVC(kernel="rbf")
c_grid = {"C": [1, 10, 100, ...  ]}

# CV Technique "LabelKFold", "LeaveOneOut", "LeaveOneLabelOut", etc.

# To be used within GridSearch (5 in your case)
inner_cv = KFold(n_splits=5, shuffle=True, random_state=i)

# To be used in outer CV (you asked for 10)
outer_cv = KFold(n_splits=10, shuffle=True, random_state=i)

# Non_nested parameter search and scoring
clf = GridSearchCV(estimator=svr, param_grid=c_grid, cv=inner_cv)
clf.fit(X_iris, y_iris)
non_nested_score = clf.best_score_

# Pass the gridSearch estimator to cross_val_score
# This will be your required 10 x 5 cvs
# 10 for outer cv and 5 for gridSearch's internal CV
clf = GridSearchCV(estimator=svr, param_grid=c_grid, cv=inner_cv)
nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv).mean()

编辑 - 使用 cross_val_score()GridSearchCV()

Edit - Description of nested cross validation with cross_val_score() and GridSearchCV()

  1. clf = GridSearchCV(estimator, param_grid, cv=inner_cv).
  2. clf、X、y、outer_cv传递给cross_val_score
  3. 源代码所示cross_val_score,这个X会用outer_cv分成X_outer_train, X_outer_test.y 也一样.
  4. X_outer_test 将被阻止,X_outer_train 将被传递给 clf for fit()(在我们的例子中是 GridSearchCV).假设 X_outer_train 从这里开始被称为 X_inner,因为它被传递给内部估计器,假设 y_outer_trainy_inner.
  5. X_inner 现在将使用 GridSearchCV 中的 inner_cv 拆分为 X_inner_trainX_inner_test.y 也一样
  6. 现在 gridSearch 估计器将使用 X_inner_trainy_train_inner 进行训练,并使用 X_inner_testy_inner_test 进行评分.
  7. 对于inner_cv_iters(在本例中为5),将重复第5 步和第6 步.
  8. 所有内部迭代(X_inner_train, X_inner_test) 的平均得分最好的超参数被传递到 clf.best_estimator_ 并适合所有数据,即X_outer_train.
  9. clf (gridsearch.best_estimator_) 然后将使用 X_outer_testy_outer_test 进行评分.
  10. 对于outer_cv_iters(此处为10)将重复步骤3到9,并且将从cross_val_score返回分数数组
  11. 然后我们使用 mean() 返回 nested_score.
  1. clf = GridSearchCV(estimator, param_grid, cv= inner_cv).
  2. Pass clf, X, y, outer_cv to cross_val_score
  3. As seen in source code of cross_val_score, this X will be divided into X_outer_train, X_outer_test using outer_cv. Same for y.
  4. X_outer_test will be held back and X_outer_train will be passed on to clf for fit() (GridSearchCV in our case). Assume X_outer_train is called X_inner from here on since it is passed to inner estimator, assume y_outer_train is y_inner.
  5. X_inner will now be split into X_inner_train and X_inner_test using inner_cv in the GridSearchCV. Same for y
  6. Now the gridSearch estimator will be trained using X_inner_train and y_train_inner and scored using X_inner_test and y_inner_test.
  7. The steps 5 and 6 will be repeated for inner_cv_iters (5 in this case).
  8. The hyper-parameters for which the average score over all inner iterations (X_inner_train, X_inner_test) is best, is passed on to the clf.best_estimator_ and fitted for all data, i.e. X_outer_train.
  9. This clf (gridsearch.best_estimator_) will then be scored using X_outer_test and y_outer_test.
  10. The steps 3 to 9 will be repeated for outer_cv_iters (10 here) and array of scores will returned from cross_val_score
  11. We then use mean() to get back nested_score.

这篇关于具有多次重复的 scikit-learn GridSearchCV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆