如何在python的sklearn中使用gridsearchcv执行特征选择 [英] How to perform feature selection with gridsearchcv in sklearn in python

查看:33
本文介绍了如何在python的sklearn中使用gridsearchcv执行特征选择的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用递归特征消除和交叉验证(rfecv)作为随机森林分类器的特征选择器,如下所示.

I am using recursive feature elimination with cross validation (rfecv) as a feature selector for randomforest classifier as follows.

X = df[[my_features]] #all my features
y = df['gold_standard'] #labels

clf = RandomForestClassifier(random_state = 42, class_weight="balanced")
rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(10), scoring='roc_auc')
rfecv.fit(X,y)

print("Optimal number of features : %d" % rfecv.n_features_)
features=list(X.columns[rfecv.support_])

我还按如下方式执行 GridSearchCV 以如下调整 RandomForestClassifier 的超参数.

I am also performing GridSearchCV as follows to tune the hyperparameters of RandomForestClassifier as follows.

X = df[[my_features]] #all my features
y = df['gold_standard'] #labels

x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=0)

rfc = RandomForestClassifier(random_state=42, class_weight = 'balanced')
param_grid = { 
    'n_estimators': [200, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy']
}
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= k_fold, scoring = 'roc_auc')
CV_rfc.fit(x_train, y_train)
print(CV_rfc.best_params_)
print(CV_rfc.best_score_)
print(CV_rfc.best_estimator_)

pred = CV_rfc.predict_proba(x_test)[:,1]
print(roc_auc_score(y_test, pred))

但是,我不清楚如何将特征选择 (rfecv) 与 GridSearchCV 合并.

However, I am not clear how to merge feature selection (rfecv) with GridSearchCV.

当我运行@Gambit 建议的答案时,出现以下错误:

When I run the answer suggested by @Gambit I got the following error:

ValueError: Invalid parameter criterion for estimator RFECV(cv=StratifiedKFold(n_splits=10, random_state=None, shuffle=False),
   estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators='warn', n_jobs=None, oob_score=False,
            random_state=42, verbose=0, warm_start=False),
   min_features_to_select=1, n_jobs=None, scoring='roc_auc', step=1,
   verbose=0). Check the list of available parameters with `estimator.get_params().keys()`.

我可以通过在 param_grid 参数列表中使用 estimator__ 来解决上述问题.

I could resolve the above issue by using estimator__ in the param_grid parameter list.

我现在的问题是如何使用 x_test 中选定的特征和参数来验证模型是否可以在未见数据的情况下正常工作.如何获得最佳特征并使用最佳超参数对其进行训练?

My question now is How to use the selected features and parameters in x_test to verify if the model works fine with unseen data. How can I obtain the best features and train it with the optimal hyperparameters?

如果需要,我很乐意提供更多详细信息.

I am happy to provide more details if needed.

推荐答案

基本上,您希望在使用递归特征消除(使用交叉验证)进行特征选择后微调分类器的超参数(使用交叉验证).

Basically you want to fine tune the hyper parameter of your classifier (with Cross validation) after feature selection using recursive feature elimination (with Cross validation).

管道对象正是用于组装数据转换和应用估计器的目的.

Pipeline object is exactly meant for this purpose of assembling the data transformation and applying estimator.

也许您可以使用不同的模型(GradientBoostingClassifier 等)进行最终分类.可以使用以下方法:

May be you could use a different model (GradientBoostingClassifier, etc. ) for your final classification. It would be possible with the following approach:

from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import RFECV
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.33, 
                                                    random_state=42)


from sklearn.pipeline import Pipeline

#this is the classifier used for feature selection
clf_featr_sele = RandomForestClassifier(n_estimators=30, 
                                        random_state=42,
                                        class_weight="balanced") 
rfecv = RFECV(estimator=clf_featr_sele, 
              step=1, 
              cv=5, 
              scoring = 'roc_auc')

#you can have different classifier for your final classifier
clf = RandomForestClassifier(n_estimators=10, 
                             random_state=42,
                             class_weight="balanced") 
CV_rfc = GridSearchCV(clf, 
                      param_grid={'max_depth':[2,3]},
                      cv= 5, scoring = 'roc_auc')

pipeline  = Pipeline([('feature_sele',rfecv),
                      ('clf_cv',CV_rfc)])

pipeline.fit(X_train, y_train)
pipeline.predict(X_test)

现在,您可以将此管道(包括特征选择)应用于测试数据.

Now, you can apply this pipeline (Including feature selection) for test data.

这篇关于如何在python的sklearn中使用gridsearchcv执行特征选择的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆