GridSearchCV 是否执行交叉验证? [英] Does GridSearchCV perform cross-validation?

查看:120
本文介绍了GridSearchCV 是否执行交叉验证?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在研究一个问题,该问题在同一数据集上比较三种不同的机器学习算法的性能.我将数据集分成 70/30 个训练/测试集,然后使用 GridSearchCV 和 X_train, y_train 对每个算法的最佳参数进行网格搜索.

I'm currently working on a problem which compares three different machine learning algorithms performance on the same data-set. I divided the data-set into 70/30 training/testing sets and then performed grid search for the best parameters of each algorithm using GridSearchCV and X_train, y_train.

第一个问题,我是想在训练集上执行网格搜索还是想在整个数据集上执行网格搜索?

First question, am I suppose to perform grid search on the training set or is it suppose to be on the whole data-set?

第二个问题,我知道 GridSearchCV 在其实现中使用 K-fold,如果我对我比较的所有三种算法使用相同的 X_train, y_train 是否意味着我执行了交叉验证GridSearchCV?

Second question, I know that GridSearchCV uses K-fold in its' implementation, does it mean that I performed cross-validation if I used the same X_train, y_train for all three algorithms I compare in the GridSearchCV?

任何答案将不胜感激,谢谢.

Any answer would be appreciated, thank you.

推荐答案

scikit 中所有名称以 CV 结尾的估算器都执行交叉验证.但是你需要保留一个单独的测试集来衡量性能.

All estimators in scikit where name ends with CV perform cross-validation. But you need to keep a separate test set for measuring the performance.

因此您需要拆分整个数据以进行训练和测试.暂时忘记这个测试数据.

So you need to split your whole data to train and test. Forget about this test data for a while.

然后仅将此训练数据传递给网格搜索.GridSearch 会将此训练数据进一步拆分为训练和测试,以调整传递给它的超参数.最后将模型拟合到具有最佳参数的整个训练数据上.

And then pass this train data only to grid-search. GridSearch will split this train data further into train and test to tune the hyper-parameters passed to it. And finally fit the model on the whole train data with best found parameters.

现在您需要在您最初保留的测试数据上测试此模型.这将为您提供接近真实世界的模型性能.

Now you need to test this model on the test data you kept aside in the beginning. This will give you the near real world performance of model.

如果您将整个数据用于 GridSearchCV,那么测试数据就会泄漏到参数调整中,最终模型可能无法在较新的未见数据上表现得那么好.

If you use the whole data into GridSearchCV, then there would be leakage of test data into parameter tuning and then the final model may not perform that well on newer unseen data.

您可以查看我的其他更详细描述 GridSearch 的答案:

You can look at my other answers which describe the GridSearch in more detail:

这篇关于GridSearchCV 是否执行交叉验证?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆