在有或没有管道的情况下,如何在k折交叉验证后提取重要特征? [英] How to extract important features after k-fold cross validation, with or without a pipeline?

查看:102
本文介绍了在有或没有管道的情况下,如何在k折交叉验证后提取重要特征?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想构建一个使用交叉验证的分类器,然后从每个折叠中提取重要特征(/系数),以便查看其稳定性。目前,我正在使用cross_validate和管道。我想使用管道,以便可以在每个折叠中进行特征选择和标准化。我一直在研究如何从每个折叠中提取功能。如果有问题,我可以使用下面的管道来替代。

I want to build a classifier that uses cross validation, and then extract the important features (/coefficients) from each fold so I can look at their stability. At the moment I am using cross_validate and a pipeline. I want to use a pipeline so that I can do feature selection and standardization within each fold. I'm stuck on how to extract the features from each fold. I have a different option to using a pipeline below, if that is the problem.

到目前为止,这是我的代码(我想尝试使用 SVM 和逻辑回归)。我以一个小df为例:

This is my code so far (I want to try SVM and logistic regression). I've included a small df as example:

from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif
from sklearn.model_selection import cross_validate
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
import pandas as pd

df = pd.DataFrame({'length': [5, 8, 0.2, 10, 25, 3.2], 
                   'width': [60, 102, 80.5, 30, 52, 81],
                   'group': [1, 0, 0, 0, 1, 1]})

array = df.values
y = array[:,2]
X = array[:,0:2]

select = SelectKBest(mutual_info_classif, k=2)
scl = StandardScaler()
svm = SVC(kernel='linear', probability=True, random_state=42)
logr = LogisticRegression(random_state=42)

pipeline = Pipeline([('select', select), ('scale', scl), ('svm', svm)])

split = KFold(n_splits=2, shuffle=True, random_state=42)

output = cross_validate(pipeline, X, y, cv=split, 
                scoring = ('accuracy', 'f1', 'roc_auc'),
                return_estimator = True,
                return_train_score= True)

我想我可以做些类似的事情:

I thought I could do something like:

pipeline.named_steps['svm'].coef_

收到错误消息:

AttributeError: 'SVC' object has no attribute 'dual_coef_'

如果不可能使用管道执行此操作,是否可以使用手动交叉验证来执行?例如:

If it's not possible to do this using a pipeline, could I do it using 'by hand' cross validation? e.g:

for train_index, test_index in kfold.split(X, y):

        kfoldtx = [X[i] for i in train_index]
        kfoldty = [y[i] for i in train_index]

但是我不确定下一步该怎么做!任何帮助将不胜感激。

But I'm not sure what to do next! Any help would be very appreciated.

推荐答案

您应该使用输出 cross_validate 的值以获取拟合模型的参数。原因是 cross_validate 会克隆管道。因此,在馈给 cross_validate 之后,您将找不到给定的管道变量是否适合。

You should use the output of cross_validate to get the parameters of the fitted model. The reason is that cross_validate would clone the pipeline. Hence you would not find given pipeline variable be fitted after been fed to cross_validate.

output 是字典,其中以 estimator 作为键之一,其值为a k_fold 个已拟合管道对象的数量。

output is dictionary, which has estimator as one of the keys, whose value is a k_fold number of fitted pipeline objects.

来自 文档


return_estimator:布尔值,默认为False

return_estimator : boolean, default False

是否返回每个分割拟合的估计量。

Whether to return the estimators fitted on each split.

尝试一下!

>>> fitted_svc = output['estimator'][0].named_steps['svm'] # choosing the first fold comb
>>> fitted_svc.coef_

array([[1.05826838, 0.41630046]])

这篇关于在有或没有管道的情况下,如何在k折交叉验证后提取重要特征?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆