仅基于网格分数的Scikit-Learn RFECV功能数量 [英] Scikit-Learn RFECV number of features based on grid scores only

查看:285
本文介绍了仅基于网格分数的Scikit-Learn RFECV功能数量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

来自scikit-learn RFE文档,该算法会依次选择较小的特征集,并且仅保留权重最高的特征.权重较低的要素将被删除,并且此过程会重复进行,直到剩余的要素数量与用户指定的数量匹配(或默认为原始要素数量的一半)为止.

From the scikit-learn RFE documentation, successively smaller sets of features are selected by the algorithm and only the features with the highest weights are preserved. Features with low weights are dropped and this process repeats itself until the number of features remaining matches that specified by the user (or is taken to be half of the original number of features by default).

RFECV文档表示这些功能是在RFE和KFCV中排名.

The RFECV docs indicate that the features are ranked with RFE and KFCV.

我们的文档示例RFECV :

from sklearn.svm import SVC
from sklearn.cross_validation import StratifiedKFold
from sklearn.feature_selection import RFECV,RFE
from sklearn.datasets import make_classification

# Build a classification task using 3 informative features
X, y = make_classification(n_samples=1000, n_features=25, n_informative=3,
                           n_redundant=2, n_repeated=0, n_classes=8,
                           n_clusters_per_class=1, random_state=0)

# Create the RFE object and compute a cross-validated score.
svc = SVC(kernel="linear")
# The "accuracy" scoring is proportional to the number of correct
# classifications
rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(y, 2),scoring='accuracy')
rfecv.fit(X, y)
rfe = RFE(estimator=svc, step=1)
rfe.fit(X, y)

print('Original number of features is %s' % X.shape[1])
print("RFE final number of features : %d" % rfe.n_features_)
print("RFECV final number of features : %d" % rfecv.n_features_)
print('')

import numpy as np
g_scores = rfecv.grid_scores_
indices = np.argsort(g_scores)[::-1]
print('Printing RFECV results:')
for f in range(X.shape[1]):
    print("%d. Number of features: %d;
                  Grid_Score: %f" % (f + 1, indices[f]+1, g_scores[indices[f]]))

这是我得到的输出:

Original number of features is 25
RFE final number of features : 12
RFECV final number of features : 3

Printing RFECV results:
1. Number of features: 3; Grid_Score: 0.818041
2. Number of features: 4; Grid_Score: 0.816065
3. Number of features: 5; Grid_Score: 0.816053
4. Number of features: 6; Grid_Score: 0.799107
5. Number of features: 7; Grid_Score: 0.797047
6. Number of features: 8; Grid_Score: 0.783034
7. Number of features: 10; Grid_Score: 0.783022
8. Number of features: 9; Grid_Score: 0.781992
9. Number of features: 11; Grid_Score: 0.778028
10. Number of features: 12; Grid_Score: 0.774052
11. Number of features: 14; Grid_Score: 0.762015
12. Number of features: 13; Grid_Score: 0.760075
13. Number of features: 15; Grid_Score: 0.752003
14. Number of features: 16; Grid_Score: 0.750015
15. Number of features: 18; Grid_Score: 0.750003
16. Number of features: 22; Grid_Score: 0.748039
17. Number of features: 17; Grid_Score: 0.746003
18. Number of features: 19; Grid_Score: 0.739105
19. Number of features: 20; Grid_Score: 0.739021
20. Number of features: 21; Grid_Score: 0.738003
21. Number of features: 23; Grid_Score: 0.729068
22. Number of features: 25; Grid_Score: 0.725056
23. Number of features: 24; Grid_Score: 0.725044
24. Number of features: 2; Grid_Score: 0.506952
25. Number of features: 1; Grid_Score: 0.272896

在此特定示例中:

  1. 对于RFE:代码始终返回12个功能(大约25个功能的一半,正如文档所预期的那样)
  2. 对于RFECV,代码返回的数字不同于1-25(不是功能数量的一半)

该被选择RFECV时,特征的数量正在拾取仅基于KFCV分数

在我看来 - 即交叉验证分数压倒的特征RFE的连续修剪

It seems to me that when RFECV is being selected, the number of features is being picked only based on the KFCV scores - i.e. the cross validation scores are over-riding RFE's successive pruning of features.

这是真的吗?如果要使用本机递归特征消除算法,那么RFECV是使用此算法还是使用它的混合版本?

Is this true? If one would like to use the native recursive feature elimination algorithm, then is RFECV using this algorithm or is it using a hybrid version of it?

在RFECV中,是否对修剪后剩余的特征子集进行交叉验证?如果是这样,则每次修剪后在RFECV中可以保留多少个功能?

In RFECV, is the cross-validation being done on the subset of features remaining after pruning? If so, how many features are kept after each prune in RFECV?

推荐答案

在交叉验证的版本中,功能在每个步骤都重新排序,而最低的功能被删除-这称为递归功能选择" ".

In the cross validated version, the features are re-ranked at each step and the lowest ranked feature is dropped -- this is referred to as "recursive feature selection" in the docs.

如果要将其与朴素版本进行比较,则需要计算由RFE选择的功能的交叉验证得分.我的猜测是RFECV答案是正确的-从特征减少时模型性能的急剧提高来看,您可能具有一些高度相关的特征,这些特征正在损害模型的性能.

If you want to compare this to the naive version, you'll need to compute the the cross-validated score for the features selected by the RFE. My guess is that RFECV answer is correct -- judging from the sharp increase in model performance when the features decrease, you probably have some highly correlated features which are harming your model's performance.

这篇关于仅基于网格分数的Scikit-Learn RFECV功能数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆