如何在python中使用交叉验证执行GridSearchCV [英] How to perform GridSearchCV with cross validation in python
问题描述
我正在使用GridSearchCV
如下执行RandomForest
的超参数调整.
X = np.array(df[features]) #all features
y = np.array(df['gold_standard']) #labels
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
param_grid = {
'n_estimators': [200, 500],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth' : [4,5,6,7,8],
'criterion' :['gini', 'entropy']
}
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5)
CV_rfc.fit(x_train, y_train)
print(CV_rfc.best_params_)
我得到的结果如下.
{'criterion': 'gini', 'max_depth': 6, 'max_features': 'auto', 'n_estimators': 200}
然后,按如下所示将调整后的参数重新应用于x_test
.
rfc=RandomForestClassifier(random_state=42, criterion ='gini', max_depth= 6, max_features = 'auto', n_estimators = 200, class_weight = 'balanced')
rfc.fit(x_train, y_train)
pred=rfc.predict(x_test)
print(precision_recall_fscore_support(y_test,pred))
print(roc_auc_score(y_test,pred))
但是,我仍然不清楚如何将GridSearchCV
与10-fold cross validation
一起使用(即,不仅将调整后的参数应用于x_test
).即如下所示.
kf = StratifiedKFold(n_splits=10)
for fold, (train_index, test_index) in enumerate(kf.split(X, y), 1):
X_train = X[train_index]
y_train = y[train_index]
X_test = X[test_index]
y_test = y[test_index]
OR
因为GridSearchCV
使用crossvalidation
,我们可以同时使用所有X
和y
并获得最佳结果作为最终结果吗?
如果需要,我很乐意提供更多详细信息.
在这种情况下,您不应执行网格搜索.
在内部,GridSearchCV
将分配给它的数据集分为多个 training 和 validation 子集,然后使用提供给它的超参数网格找到一组在验证子集上得分最高的超参数.
如果您在交叉验证内进行进行网格搜索,则将有多个超参数集,每组超参数在其网格搜索验证子项中表现最佳-交叉验证拆分的子集.您无法将这些集合组合为单个一致的超参数规范,因此无法部署模型.
I am performing hyperparameter tuning of RandomForest
as follows using GridSearchCV
.
X = np.array(df[features]) #all features
y = np.array(df['gold_standard']) #labels
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
param_grid = {
'n_estimators': [200, 500],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth' : [4,5,6,7,8],
'criterion' :['gini', 'entropy']
}
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5)
CV_rfc.fit(x_train, y_train)
print(CV_rfc.best_params_)
The result I got is as follows.
{'criterion': 'gini', 'max_depth': 6, 'max_features': 'auto', 'n_estimators': 200}
Afterwards, I reapply the tuned parameters to x_test
as follows.
rfc=RandomForestClassifier(random_state=42, criterion ='gini', max_depth= 6, max_features = 'auto', n_estimators = 200, class_weight = 'balanced')
rfc.fit(x_train, y_train)
pred=rfc.predict(x_test)
print(precision_recall_fscore_support(y_test,pred))
print(roc_auc_score(y_test,pred))
However, I am still not clear how to use GridSearchCV
with 10-fold cross validation
(i.e. not just apply the tuned parameters to x_test
). i.e. something like below.
kf = StratifiedKFold(n_splits=10)
for fold, (train_index, test_index) in enumerate(kf.split(X, y), 1):
X_train = X[train_index]
y_train = y[train_index]
X_test = X[test_index]
y_test = y[test_index]
OR
sinceGridSearchCV
uses crossvalidation
can we use all X
and y
and get the best result as the final result?
I am happy to provide more details if needed.
You should not perform a grid search in this scenario.
Internally, GridSearchCV
splits the dataset given to it into various training and validation subsets, and, using the hyperparameter grid provided to it, finds the single set of hyperparameters that give the best score on the validation subsets.
The point of a train-test split is then, after this process is done, to perform one final scoring on the test data, which has so far been unknown to the model, to see if your hyperparameters have been overfit to the validation subsets. If it does well, then the next step is putting the model into production/deployment.
If you perform a grid search within cross-validation, then you will have multiple sets of hyperparameters, each of which did the best on their grid-search validation sub-subset of the cross-validation split. You cannot combine these sets into a single coherent hyperparameter specification, and therefore you cannot deploy your model.
这篇关于如何在python中使用交叉验证执行GridSearchCV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!