在RFECV scikit-learn中获取功能 [英] Getting features in RFECV scikit-learn

查看:205
本文介绍了在RFECV scikit-learn中获取功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

受此启发:



在这种情况下,我想知道,当#功能=10。



有任何想法吗?



编辑:



这是用于获取该图的代码:

 从sklearn.feature_selection从sklearn导入RFECV 
.model_selectio n从sklearn.ensemble导入KFold,StratifiedKFold#用于K折交叉验证
import RandomForestClassifier #Random Forest

#准确性评分与正确分类的数量$ b $成正比b #kfold = StratifiedKFold(n_splits = 10,random_state = 1)#k = 10,将数据分成10个相等的部分
model_Linear_SVM = svm.SVC(内核='线性',概率=真)
rfecv = RFECV(估计量= model_Linear_SVM,step = 1,cv = kfold,得分='准确性')#5倍交叉验证
rfecv = rfecv.fit(X,y)

print('最佳特征数量:',rfecv.n_features_)
print('最佳特征:',X.columns [rfecv.support_])
print('原始特征:',X.列)
plt.figure()
plt.xlabel(所选特征数)
plt.ylabel(所选特征数的交叉验证得分\n)
plt.plot(range(1,len(rfecv.grid_scores_)+ 1),rfecv.grid_scores_)
plt.show()


解决方案

首先,您可以看到
$ b $,在交叉验证得分最高的情况下,它选择了哪些功能(在您的情况下,这对应于功能17或21的数量,从图中我不确定)。 b

  rfecv.support_ 

  rfecv.ranking_ 

然后您可以通过

  np.absolute()计算选定功能的重要性(针对简历分数曲线的峰值)。 rfecv.estimator_.coef_)

用于简单估算器或

  rfecv.estimator_.feature_importances_ 

如果您的估算器



然后,您可以在循环中一个接一个地删除最不重要的功能,然后为其余功能集重新计算rfecv。


Inspired by this: http://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_with_cross_validation.html#sphx-glr-auto-examples-feature-selection-plot-rfe-with-cross-validation-py

I am wondering if there is anyway to get the features for a particular score:

In that case, I would like to know, which 10 features selected gives that peak when #Features = 10.

Any ideas?

EDIT:

This is the code used to get that plot:

from sklearn.feature_selection import RFECV
from sklearn.model_selection import KFold,StratifiedKFold #for K-fold cross validation
from sklearn.ensemble import RandomForestClassifier #Random Forest

# The "accuracy" scoring is proportional to the number of correct classifications
#kfold = StratifiedKFold(n_splits=10, random_state=1) # k=10, split the data into 10 equal parts
model_Linear_SVM=svm.SVC(kernel='linear', probability=True)
rfecv = RFECV(estimator=model_Linear_SVM, step=1, cv=kfold,scoring='accuracy')   #5-fold cross-validation
rfecv = rfecv.fit(X, y)

print('Optimal number of features :', rfecv.n_features_)
print('Best features :', X.columns[rfecv.support_])
print('Original features :', X.columns)
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score \n of number of selected features")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()

解决方案

First, you can see which features it selected where the cross validation score is the largest (in your case this corresponds to the number of features 17 or 21, I am not sure from the figure) with

rfecv.support_

or

rfecv.ranking_ 

Then you can calculate the importances of selected features (for the peak of the cv score curve) by

np.absolute(rfecv.estimator_.coef_)

for simple estimators or

rfecv.estimator_.feature_importances_ 

if your estimator is some ensemble, like random forest.

Then you can remove the least important feature one by one in the loop, and recalculate rfecv for the remaining feature sets.

这篇关于在RFECV scikit-learn中获取功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆