如何在python的sklearn中使用gridsearchcv执行特征选择 [英] How to perform feature selection with gridsearchcv in sklearn in python
问题描述
我使用递归特征消除和交叉验证(rfecv)
作为随机森林分类器
的特征选择器,如下所示.
I am using recursive feature elimination with cross validation (rfecv)
as a feature selector for randomforest classifier
as follows.
X = df[[my_features]] #all my features
y = df['gold_standard'] #labels
clf = RandomForestClassifier(random_state = 42, class_weight="balanced")
rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(10), scoring='roc_auc')
rfecv.fit(X,y)
print("Optimal number of features : %d" % rfecv.n_features_)
features=list(X.columns[rfecv.support_])
我还按如下方式执行 GridSearchCV
以如下调整 RandomForestClassifier
的超参数.
I am also performing GridSearchCV
as follows to tune the hyperparameters of RandomForestClassifier
as follows.
X = df[[my_features]] #all my features
y = df['gold_standard'] #labels
x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=0)
rfc = RandomForestClassifier(random_state=42, class_weight = 'balanced')
param_grid = {
'n_estimators': [200, 500],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth' : [4,5,6,7,8],
'criterion' :['gini', 'entropy']
}
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= k_fold, scoring = 'roc_auc')
CV_rfc.fit(x_train, y_train)
print(CV_rfc.best_params_)
print(CV_rfc.best_score_)
print(CV_rfc.best_estimator_)
pred = CV_rfc.predict_proba(x_test)[:,1]
print(roc_auc_score(y_test, pred))
但是,我不清楚如何将特征选择 (rfecv
) 与 GridSearchCV
合并.
However, I am not clear how to merge feature selection (rfecv
) with GridSearchCV
.
当我运行@Gambit 建议的答案时,出现以下错误:
When I run the answer suggested by @Gambit I got the following error:
ValueError: Invalid parameter criterion for estimator RFECV(cv=StratifiedKFold(n_splits=10, random_state=None, shuffle=False),
estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced',
criterion='gini', max_depth=None, max_features='auto',
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators='warn', n_jobs=None, oob_score=False,
random_state=42, verbose=0, warm_start=False),
min_features_to_select=1, n_jobs=None, scoring='roc_auc', step=1,
verbose=0). Check the list of available parameters with `estimator.get_params().keys()`.
我可以通过在 param_grid
参数列表中使用 estimator__
来解决上述问题.
I could resolve the above issue by using estimator__
in the param_grid
parameter list.
我现在的问题是如何使用 x_test
中选定的特征和参数来验证模型是否可以在未见数据的情况下正常工作.如何获得最佳特征
并使用最佳超参数
对其进行训练?
My question now is How to use the selected features and parameters in x_test
to verify if the model works fine with unseen data. How can I obtain the best features
and train it with the optimal hyperparameters
?
如果需要,我很乐意提供更多详细信息.
I am happy to provide more details if needed.
推荐答案
基本上,您希望在使用递归特征消除(使用交叉验证)进行特征选择后微调分类器的超参数(使用交叉验证).
Basically you want to fine tune the hyper parameter of your classifier (with Cross validation) after feature selection using recursive feature elimination (with Cross validation).
管道对象正是用于组装数据转换和应用估计器的目的.
Pipeline object is exactly meant for this purpose of assembling the data transformation and applying estimator.
也许您可以使用不同的模型(GradientBoostingClassifier
等)进行最终分类.可以使用以下方法:
May be you could use a different model (GradientBoostingClassifier
, etc. ) for your final classification. It would be possible with the following approach:
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import RFECV
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.33,
random_state=42)
from sklearn.pipeline import Pipeline
#this is the classifier used for feature selection
clf_featr_sele = RandomForestClassifier(n_estimators=30,
random_state=42,
class_weight="balanced")
rfecv = RFECV(estimator=clf_featr_sele,
step=1,
cv=5,
scoring = 'roc_auc')
#you can have different classifier for your final classifier
clf = RandomForestClassifier(n_estimators=10,
random_state=42,
class_weight="balanced")
CV_rfc = GridSearchCV(clf,
param_grid={'max_depth':[2,3]},
cv= 5, scoring = 'roc_auc')
pipeline = Pipeline([('feature_sele',rfecv),
('clf_cv',CV_rfc)])
pipeline.fit(X_train, y_train)
pipeline.predict(X_test)
现在,您可以将此管道(包括特征选择)应用于测试数据.
Now, you can apply this pipeline (Including feature selection) for test data.
这篇关于如何在python的sklearn中使用gridsearchcv执行特征选择的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!