Scikit-learn - 使用 RFECV 和 GridSearch 的特征减少.系数存储在哪里? [英] Scikit-learn - feature reduction using RFECV and GridSearch. Where are the coefficients stored?
问题描述
我正在使用 Scikit-learn RFECV 为使用交叉验证的逻辑回归选择最重要的特征.假设 X 是一个 [n,x] 特征数据框,y 代表响应变量:
I am using Scikit-learn RFECV to select most significant features for a logistic regression using a Cross Validation. Assume X is a [n,x] dataframe of features, and y represents the response variable:
from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import StratifiedKFold
from sklearn import preprocessing
from sklearn.feature_selection import RFECV
import sklearn
import sklearn.linear_model as lm
import sklearn.grid_search as gs
# Create a logistic regression estimator
logreg = lm.LogisticRegression()
# Use RFECV to pick best features, using Stratified Kfold
rfecv = RFECV(estimator=logreg, cv=StratifiedKFold(y, 3), scoring='roc_auc')
# Fit the features to the response variable
rfecv.fit(X, y)
# Put the best features into new df X_new
X_new = rfecv.transform(X)
#
pipe = make_pipeline(preprocessing.StandardScaler(), lm.LogisticRegression())
# Define a range of hyper parameters for grid search
C_range = 10.**np.arange(-5, 1)
penalty_options = ['l1', 'l2']
skf = StratifiedKFold(y, 3)
param_grid = dict(logisticregression__C=C_range, logisticregression__penalty=penalty_options)
grid = GridSearchCV(pipe, param_grid, cv=skf, scoring='roc_auc')
grid.fit(X_new, y)
两个问题:
a) 这是特征、超参数选择和拟合的正确过程吗?
a) Is this the correct process for feature, hyper-parameter selection and fitting?
b) 在哪里可以找到所选特征的拟合系数?
b) Where can I find the fitted coefficients for the selected features?
推荐答案
这是特征选择的正确过程吗?这是特征选择的众多方法之一.递归特征消除是一种自动化方法,其他在 scikit.learn 文档中列出.它们有不同的优点和缺点,通常最好通过还涉及常识和尝试具有不同特征的模型来实现特征选择.RFE 是一种快速选择一组好的功能的方法,但并不一定会给您最终最好的.顺便说一下,您不需要单独构建 StratifiedKFold.如果您只是将 cv
参数设置为 cv=3
,则 RFECV
和 GridSearchCV
将自动使用 StratifiedKFold,如果y 值是二进制或多类,我假设最有可能是这种情况,因为您使用的是 LogisticRegression
.你也可以结合
Is this the correct process for feature selection?
This is ONE of the many ways of feature selection. Recursive feature elimination is an automated approach to this, others are listed in scikit.learn documentation. They have different pros and cons, and usually feature selection is best achieved by also involving common sense and trying models with different features. RFE is a quick way of selecting a good set of features, but does not necessarily give you the ultimately best. By the way, you don't need to build your StratifiedKFold separately. If you just set the cv
parameter to cv=3
, both RFECV
and GridSearchCV
will automatically use StratifiedKFold if the y values are binary or multiclass, which I'm assuming is most likely the case since you are using LogisticRegression
.
You can also combine
# Fit the features to the response variable
rfecv.fit(X, y)
# Put the best features into new df X_new
X_new = rfecv.transform(X)
进入
X_new = rfecv.fit_transform(X, y)
这是选择超参数的正确过程吗?GridSearchCV 基本上是一种自动方式,它系统地尝试一整套模型参数组合,并根据某些性能指标从中挑选最好的.是的,这是找到合适参数的好方法.
Is this the correct process for hyper-parameter selection? GridSearchCV is basically an automated way of systematically trying a whole set of combinations of model parameters and picking the best among these according to some performance metric. It's a good way of finding well-suited parameters, yes.
这是正确的拟合过程吗?是的,这是拟合模型的有效方法.当您调用 grid.fit(X_new, y)
时,它会生成一个 LogisticRegression
估计器的网格(每个估计器都有一组尝试过的参数)并拟合它们中的每一个.它将保留grid.best_estimator_
下性能最好的那个,grid.best_params_
中这个estimator的参数,以及grid下这个estimator的性能分数.best_score_
.它将返回自身,而不是最佳估计器.请记住,对于将使用模型进行预测的传入新 X 值,您必须使用拟合的 RFECV 模型应用变换.因此,您实际上也可以将此步骤添加到管道中.
Is this the correct process for fitting?
Yes, this is a valid way of fitting the model. When you call grid.fit(X_new, y)
, it makes a grid of LogisticRegression
estimators (each with a set of parameters that are tried) and fits each of them. It will keep the one with the best performance under grid.best_estimator_
, the parameters of this estimator in grid.best_params_
and the performance score for this estimator under grid.best_score_
. It will return itself, and not the best estimator. Remember that with incoming new X values that you will use the model to predict on, you have to apply the transform with the fitted RFECV model. So, you can actually add this step to the pipeline as well.
在哪里可以找到所选特征的拟合系数?grid.best_estimator_
属性是包含所有这些信息的 LogisticRegression
对象,所以 grid.best_estimator_.coef_
具有所有系数(和 grid.best_estimator_.intercept_
是截距).请注意,为了能够获得这个 grid.best_estimator_
,GridSearchCV
上的 refit
参数需要设置为 True
>,但无论如何这是默认设置.
Where can I find the fitted coefficients for the selected features?
The grid.best_estimator_
attribute is a LogisticRegression
object with all this information, so grid.best_estimator_.coef_
has all the coefficients (and grid.best_estimator_.intercept_
is the intercept). Note that to be able to get this grid.best_estimator_
, the refit
parameter on GridSearchCV
needs to be set to True
, but this is the default anyway.
这篇关于Scikit-learn - 使用 RFECV 和 GridSearch 的特征减少.系数存储在哪里?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!