Scikit-learn - 使用 RFECV 和 GridSearch 的特征减少.系数存储在哪里? [英] Scikit-learn - feature reduction using RFECV and GridSearch. Where are the coefficients stored?

查看:99
本文介绍了Scikit-learn - 使用 RFECV 和 GridSearch 的特征减少.系数存储在哪里?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Scikit-learn RFECV 为使用交叉验证的逻辑回归选择最重要的特征.假设 X 是一个 [n,x] 特征数据框,y 代表响应变量:

I am using Scikit-learn RFECV to select most significant features for a logistic regression using a Cross Validation. Assume X is a [n,x] dataframe of features, and y represents the response variable:

from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import StratifiedKFold
from sklearn import preprocessing
from sklearn.feature_selection import RFECV
import sklearn
import sklearn.linear_model as lm
import sklearn.grid_search as gs

#  Create a logistic regression estimator 
logreg = lm.LogisticRegression()

# Use RFECV to pick best features, using Stratified Kfold
rfecv =   RFECV(estimator=logreg, cv=StratifiedKFold(y, 3), scoring='roc_auc')

# Fit the features to the response variable
rfecv.fit(X, y)

# Put the best features into new df X_new
X_new = rfecv.transform(X)

# 
pipe = make_pipeline(preprocessing.StandardScaler(), lm.LogisticRegression())

# Define a range of hyper parameters for grid search
C_range = 10.**np.arange(-5, 1)
penalty_options = ['l1', 'l2']

skf = StratifiedKFold(y, 3)
param_grid = dict(logisticregression__C=C_range,  logisticregression__penalty=penalty_options)

grid = GridSearchCV(pipe, param_grid, cv=skf, scoring='roc_auc')

grid.fit(X_new, y) 

两个问题:

a) 这是特征、超参数选择和拟合的正确过程吗?

a) Is this the correct process for feature, hyper-parameter selection and fitting?

b) 在哪里可以找到所选特征的拟合系数?

b) Where can I find the fitted coefficients for the selected features?

推荐答案

这是特征选择的正确过程吗?这是特征选择的众多方法之一.递归特征消除是一种自动化方法,其他在 scikit.learn 文档中列出.它们有不同的优点和缺点,通常最好通过还涉及常识和尝试具有不同特征的模型来实现特征选择.RFE 是一种快速选择一组好的功能的方法,但并不一定会给您最终最好的.顺便说一下,您不需要单独构建 StratifiedKFold.如果您只是将 cv 参数设置为 cv=3,则 RFECVGridSearchCV 将自动使用 StratifiedKFold,如果y 值是二进制或多类,我假设最有可能是这种情况,因为您使用的是 LogisticRegression.你也可以结合

Is this the correct process for feature selection? This is ONE of the many ways of feature selection. Recursive feature elimination is an automated approach to this, others are listed in scikit.learn documentation. They have different pros and cons, and usually feature selection is best achieved by also involving common sense and trying models with different features. RFE is a quick way of selecting a good set of features, but does not necessarily give you the ultimately best. By the way, you don't need to build your StratifiedKFold separately. If you just set the cv parameter to cv=3, both RFECV and GridSearchCV will automatically use StratifiedKFold if the y values are binary or multiclass, which I'm assuming is most likely the case since you are using LogisticRegression. You can also combine

# Fit the features to the response variable
rfecv.fit(X, y)

# Put the best features into new df X_new
X_new = rfecv.transform(X)

进入

X_new = rfecv.fit_transform(X, y)

这是选择超参数的正确过程吗?GridSearchCV 基本上是一种自动方式,它系统地尝试一整套模型参数组合,并根据某些性能指标从中挑选最好的.是的,这是找到合适参数的好方法.

Is this the correct process for hyper-parameter selection? GridSearchCV is basically an automated way of systematically trying a whole set of combinations of model parameters and picking the best among these according to some performance metric. It's a good way of finding well-suited parameters, yes.

这是正确的拟合过程吗?是的,这是拟合模型的有效方法.当您调用 grid.fit(X_new, y) 时,它会生成一个 LogisticRegression 估计器的网格(每个估计器都有一组尝试过的参数)并拟合它们中的每一个.它将保留grid.best_estimator_下性能最好的那个,grid.best_params_中这个estimator的参数,以及grid下这个estimator的性能分数.best_score_.它将返回自身,而不是最佳估计器.请记住,对于将使用模型进行预测的传入新 X 值,您必须使用拟合的 RFECV 模型应用变换.因此,您实际上也可以将此步骤添加到管道中.

Is this the correct process for fitting? Yes, this is a valid way of fitting the model. When you call grid.fit(X_new, y), it makes a grid of LogisticRegression estimators (each with a set of parameters that are tried) and fits each of them. It will keep the one with the best performance under grid.best_estimator_, the parameters of this estimator in grid.best_params_ and the performance score for this estimator under grid.best_score_. It will return itself, and not the best estimator. Remember that with incoming new X values that you will use the model to predict on, you have to apply the transform with the fitted RFECV model. So, you can actually add this step to the pipeline as well.

在哪里可以找到所选特征的拟合系数?grid.best_estimator_ 属性是包含所有这些信息的 LogisticRegression 对象,所以 grid.best_estimator_.coef_ 具有所有系数(和 grid.best_estimator_.intercept_ 是截距).请注意,为了能够获得这个 grid.best_estimator_GridSearchCV 上的 refit 参数需要设置为 True>,但无论如何这是默认设置.

Where can I find the fitted coefficients for the selected features? The grid.best_estimator_ attribute is a LogisticRegression object with all this information, so grid.best_estimator_.coef_ has all the coefficients (and grid.best_estimator_.intercept_ is the intercept). Note that to be able to get this grid.best_estimator_, the refit parameter on GridSearchCV needs to be set to True, but this is the default anyway.

这篇关于Scikit-learn - 使用 RFECV 和 GridSearch 的特征减少.系数存储在哪里?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆