我们怎样才能知道选择和省略的功能(列)名称(标题)使用scikit学习 [英] how can we get to know the selected and omitted features (columns ) names (header) using scikit-learn

查看:70
本文介绍了我们怎样才能知道选择和省略的功能(列)名称(标题)使用scikit学习的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用一段数据解释这种情况:

I am explaining the scenario with a piece of data:

例如数据集.

GA_ID   PN_ID   PC_ID   MBP_ID  GR_ID   AP_ID   class
0.033   6.652   6.681   0.194   0.874   3.177     0
0.034   9.039   6.224   0.194   1.137   0         0
0.035   10.936  10.304  1.015   0.911   4.9       1
0.022   10.11   9.603   1.374   0.848   4.566     1
0.035   2.963   17.156  0.599   0.823   9.406     1
0.033   10.872  10.244  1.015   0.574   4.871     1
0.035   21.694  22.389  1.015   0.859   9.259     1
0.035   10.936  10.304  1.015   0.911   4.9       1
0.035   10.936  10.304  1.015   0.911   4.9       1
0.035   10.936  10.304  1.015   0.911   4.9       0
0.036   1.373   12.034  0.35    0.259   5.723     0
0.033   9.831   9.338   0.35    0.919   4.44      0

特征选择步骤1和其出来:VarianceThreshol

feature selection step 1 and its out come : VarianceThreshol

     PN_ID  PC_ID   MBP_ID  GR_ID   AP_ID   class
    6.652   6.681   0.194   0.874   3.177     0
    9.039   6.224   0.194   1.137   0         0
    10.936  10.304  1.015   0.911   4.9       1
    10.11   9.603   1.374   0.848   4.566     1
    2.963   17.156  0.599   0.823   9.406     1
    10.872  10.244  1.015   0.574   4.871     1
    21.694  22.389  1.015   0.859   9.259     1
    10.936  10.304  1.015   0.911   4.9       1
    10.936  10.304  1.015   0.911   4.9       1
    10.936  10.304  1.015   0.911   4.9       0
    1.373   12.034  0.35    0.259   5.723     0
    9.831   9.338   0.35    0.919   4.44      0

特征选择第2步及其完成:基于树的特征选择(例如,来自klearn.ensemble import ExtraTreesClassifier)

feature selection step 2 and its out come : Tree-based feature selection (Ex. from klearn.ensemble import ExtraTreesClassifier)

PN_ID   MBP_ID  GR_ID   AP_ID   class
6.652   0.194   0.874   3.177     0
9.039   0.194   1.137   0         0
10.936  1.015   0.911   4.9       1
10.11   1.374   0.848   4.566     1
2.963   0.599   0.823   9.406     1
10.872  1.015   0.574   4.871     1
21.694  1.015   0.859   9.259     1
10.936  1.015   0.911   4.9       1
10.936  1.015   0.911   4.9       1
10.936  1.015   0.911   4.9       0
1.373   0.35    0.259   5.723     0
9.831   0.35    0.919   4.44      0

在这里我们可以得出结论,我们从6列(功能)和一个类标签开始,最后一步将其缩减为4个功能和一个类标签. GA_ID和PC_ID列已被去除,而模型已经使用PN_ID,MBP_ID,GR_ID和AP_ID特征构造.

Here we can conclude that we started with 6 columns(features) and one class label and at the final step reduced down it to 4 features and one class label. GA_ID and PC_ID columns has been removed, while model has been constructed using PN_ID, MBP_ID, GR_ID and AP_ID features.

但不幸的是,当我在可用的方法进行特征选择scikit学习库我发现,它仅返回而不的选择并省略特征的名称形状相关的数据,减少数据.

But unfortunately when i performed feature selection with the available methods in scikit-learn library I found that it returns only shape of the data and reduced data without the name of the selected and omitted features.

我已经写下了许多愚蠢的python代码(因为我不是非常有经验的程序员)来找到答案,但没有成功.

I have write down many stupid python codes (as i m not very experience programmer) to find the answer but not succeeded.

请好心建议我某种方式感谢得到它

kindly please suggest me some way to get out of it thanks

(注:特别是对于这个后我从来没有执行对给定的示例数据集的任何特征选择方法,而我已删除的列随机解释的情况下)

(Note: Particularly for this post i have never performed any feature selection method on the given example data set, rather i have deleted the column randomly to explain the case)

推荐答案

也许这代码并评论解释将帮助(改编自

Perhaps this code and commented explanations will help (adapted from here).

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import make_classification
from sklearn.ensemble import ExtraTreesClassifier

# Build a classification task using 3 informative features
X, y = make_classification(n_samples=1000,
                           n_features=10,
                           n_informative=3,
                           n_redundant=0,
                           n_repeated=0,
                           n_classes=2,
                           random_state=0,
                           shuffle=False)

# Build a forest and compute the feature importances
forest = ExtraTreesClassifier(n_estimators=250,
                              random_state=0)


forest.fit(X, y)

importances = forest.feature_importances_ #array with importances of each feature

idx = np.arange(0, X.shape[1]) #create an index array, with the number of features

features_to_keep = idx[importances > np.mean(importances)] #only keep features whose importance is greater than the mean importance
#should be about an array of size 3 (about)
print features_to_keep.shape

x_feature_selected = X[:,features_to_keep] #pull X values corresponding to the most important features

print x_feature_selected

这篇关于我们怎样才能知道选择和省略的功能(列)名称(标题)使用scikit学习的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆