如何在python中使用交叉验证执行GridSearchCV [英] How to perform GridSearchCV with cross validation in python

查看:304
本文介绍了如何在python中使用交叉验证执行GridSearchCV的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用GridSearchCV如下执行RandomForest的超参数调整.

X = np.array(df[features]) #all features
y = np.array(df['gold_standard']) #labels

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

param_grid = { 
    'n_estimators': [200, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy']
}
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5)
CV_rfc.fit(x_train, y_train)
print(CV_rfc.best_params_)

我得到的结果如下.

{'criterion': 'gini', 'max_depth': 6, 'max_features': 'auto', 'n_estimators': 200}

然后,按如下所示将调整后的参数重新应用于x_test.

rfc=RandomForestClassifier(random_state=42, criterion ='gini', max_depth= 6, max_features = 'auto', n_estimators = 200, class_weight = 'balanced')
rfc.fit(x_train, y_train)
pred=rfc.predict(x_test)
print(precision_recall_fscore_support(y_test,pred))
print(roc_auc_score(y_test,pred))

但是,我仍然不清楚如何将GridSearchCV10-fold cross validation一起使用(即,不仅将调整后的参数应用于x_test).即如下所示.

kf = StratifiedKFold(n_splits=10)
for fold, (train_index, test_index) in enumerate(kf.split(X, y), 1):
    X_train = X[train_index]
    y_train = y[train_index]
    X_test = X[test_index]
    y_test = y[test_index]

OR

因为GridSearchCV使用crossvalidation,我们可以同时使用所有Xy并获得最佳结果作为最终结果吗?

如果需要,我很乐意提供更多详细信息.

解决方案

在这种情况下,您不应执行网格搜索.

在内部,GridSearchCV将分配给它的数据集分为多个 training validation 子集,然后使用提供给它的超参数网格找到一组在验证子集上得分最高的超参数.

然后,在完成此过程之后,对训练数据进行一个最终评分,这是迄今为止尚不为模型所了解的结果,以查看是否您的超参数已过度适合验证子集.如果效果良好,那么下一步就是将模型投入生产/部署.

如果您在交叉验证内进行进行网格搜索,则将有多个超参数集,每组超参数在其网格搜索验证子项中表现最佳-交叉验证拆分的子集.您无法将这些集合组合为单个一致的超参数规范,因此无法部署模型.

I am performing hyperparameter tuning of RandomForest as follows using GridSearchCV.

X = np.array(df[features]) #all features
y = np.array(df['gold_standard']) #labels

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

param_grid = { 
    'n_estimators': [200, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy']
}
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5)
CV_rfc.fit(x_train, y_train)
print(CV_rfc.best_params_)

The result I got is as follows.

{'criterion': 'gini', 'max_depth': 6, 'max_features': 'auto', 'n_estimators': 200}

Afterwards, I reapply the tuned parameters to x_test as follows.

rfc=RandomForestClassifier(random_state=42, criterion ='gini', max_depth= 6, max_features = 'auto', n_estimators = 200, class_weight = 'balanced')
rfc.fit(x_train, y_train)
pred=rfc.predict(x_test)
print(precision_recall_fscore_support(y_test,pred))
print(roc_auc_score(y_test,pred))

However, I am still not clear how to use GridSearchCV with 10-fold cross validation (i.e. not just apply the tuned parameters to x_test). i.e. something like below.

kf = StratifiedKFold(n_splits=10)
for fold, (train_index, test_index) in enumerate(kf.split(X, y), 1):
    X_train = X[train_index]
    y_train = y[train_index]
    X_test = X[test_index]
    y_test = y[test_index]

OR

sinceGridSearchCV uses crossvalidation can we use all X and y and get the best result as the final result?

I am happy to provide more details if needed.

解决方案

You should not perform a grid search in this scenario.

Internally, GridSearchCV splits the dataset given to it into various training and validation subsets, and, using the hyperparameter grid provided to it, finds the single set of hyperparameters that give the best score on the validation subsets.

The point of a train-test split is then, after this process is done, to perform one final scoring on the test data, which has so far been unknown to the model, to see if your hyperparameters have been overfit to the validation subsets. If it does well, then the next step is putting the model into production/deployment.

If you perform a grid search within cross-validation, then you will have multiple sets of hyperparameters, each of which did the best on their grid-search validation sub-subset of the cross-validation split. You cannot combine these sets into a single coherent hyperparameter specification, and therefore you cannot deploy your model.

这篇关于如何在python中使用交叉验证执行GridSearchCV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆