GridSearchCV是否执行交叉验证? [英] Does GridSearchCV perform cross-validation?

查看:133
本文介绍了GridSearchCV是否执行交叉验证?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在研究一个问题,该问题将在同一数据集上比较三种不同的机器学习算法的性能.我将数据集分为70/30训练/测试集,然后使用GridSearchCV和X_train, y_train对每种算法的最佳参数进行了网格搜索.

I'm currently working on a problem which compares three different machine learning algorithms performance on the same data-set. I divided the data-set into 70/30 training/testing sets and then performed grid search for the best parameters of each algorithm using GridSearchCV and X_train, y_train.

第一个问题,我是应该对训练集执行网格搜索还是应该对整个数据集进行网格搜索?

First question, am I suppose to perform grid search on the training set or is it suppose to be on the whole data-set?

第二个问题,我知道GridSearchCV在其实现中使用K折,是否表示我对GridSearchCV中比较的所有三种算法都使用相同的X_train, y_train进行了交叉验证?

Second question, I know that GridSearchCV uses K-fold in its' implementation, does it mean that I performed cross-validation if I used the same X_train, y_train for all three algorithms I compare in the GridSearchCV?

任何答案将不胜感激,谢谢.

Any answer would be appreciated, thank you.

推荐答案

scikit中所有名称以CV结尾的估计器都将执行交叉验证. 但是您需要保留一个单独的测试集来衡量性能.

All estimators in scikit where name ends with CV perform cross-validation. But you need to keep a separate test set for measuring the performance.

因此,您需要拆分整个数据以进行训练和测试.暂时忘记此测试数据.

So you need to split your whole data to train and test. Forget about this test data for a while.

然后仅将此火车数据传递给网格搜索. GridSearch会将火车数据进一步拆分为火车并进行测试,以调整传递给它的超参数.最后,使用找到的最佳参数将模型拟合到整个火车数据上.

And then pass this train data only to grid-search. GridSearch will split this train data further into train and test to tune the hyper-parameters passed to it. And finally fit the model on the whole train data with best found parameters.

现在,您需要在开始时保留的测试数据上测试此模型.这将为您提供近乎真实的模型性能.

Now you need to test this model on the test data you kept aside in the beginning. This will give you the near real world performance of model.

如果将整个数据用于GridSearchCV,则测试数据会泄漏到参数调整中,因此最终模型可能无法在较新的看不见的数据上表现不佳.

If you use the whole data into GridSearchCV, then there would be leakage of test data into parameter tuning and then the final model may not perform that well on newer unseen data.

您可以查看我的其他答案,这些答案更详细地描述了GridSearch:

You can look at my other answers which describe the GridSearch in more detail:

  • Model help using Scikit-learn when using GridSearch
  • scikit-learn GridSearchCV with multiple repetitions

这篇关于GridSearchCV是否执行交叉验证?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆