我们怎样才能知道选择和省略的功能(列)名称(标题)使用scikit学习 [英] how can we get to know the selected and omitted features (columns ) names (header) using scikit-learn
问题描述
我正在用一段数据解释这种情况:
I am explaining the scenario with a piece of data:
例如数据集.
GA_ID PN_ID PC_ID MBP_ID GR_ID AP_ID class
0.033 6.652 6.681 0.194 0.874 3.177 0
0.034 9.039 6.224 0.194 1.137 0 0
0.035 10.936 10.304 1.015 0.911 4.9 1
0.022 10.11 9.603 1.374 0.848 4.566 1
0.035 2.963 17.156 0.599 0.823 9.406 1
0.033 10.872 10.244 1.015 0.574 4.871 1
0.035 21.694 22.389 1.015 0.859 9.259 1
0.035 10.936 10.304 1.015 0.911 4.9 1
0.035 10.936 10.304 1.015 0.911 4.9 1
0.035 10.936 10.304 1.015 0.911 4.9 0
0.036 1.373 12.034 0.35 0.259 5.723 0
0.033 9.831 9.338 0.35 0.919 4.44 0
特征选择步骤1和其出来:VarianceThreshol
feature selection step 1 and its out come : VarianceThreshol
PN_ID PC_ID MBP_ID GR_ID AP_ID class
6.652 6.681 0.194 0.874 3.177 0
9.039 6.224 0.194 1.137 0 0
10.936 10.304 1.015 0.911 4.9 1
10.11 9.603 1.374 0.848 4.566 1
2.963 17.156 0.599 0.823 9.406 1
10.872 10.244 1.015 0.574 4.871 1
21.694 22.389 1.015 0.859 9.259 1
10.936 10.304 1.015 0.911 4.9 1
10.936 10.304 1.015 0.911 4.9 1
10.936 10.304 1.015 0.911 4.9 0
1.373 12.034 0.35 0.259 5.723 0
9.831 9.338 0.35 0.919 4.44 0
特征选择第2步及其完成:基于树的特征选择(例如,来自klearn.ensemble import ExtraTreesClassifier)
feature selection step 2 and its out come : Tree-based feature selection (Ex. from klearn.ensemble import ExtraTreesClassifier)
PN_ID MBP_ID GR_ID AP_ID class
6.652 0.194 0.874 3.177 0
9.039 0.194 1.137 0 0
10.936 1.015 0.911 4.9 1
10.11 1.374 0.848 4.566 1
2.963 0.599 0.823 9.406 1
10.872 1.015 0.574 4.871 1
21.694 1.015 0.859 9.259 1
10.936 1.015 0.911 4.9 1
10.936 1.015 0.911 4.9 1
10.936 1.015 0.911 4.9 0
1.373 0.35 0.259 5.723 0
9.831 0.35 0.919 4.44 0
在这里我们可以得出结论,我们从6列(功能)和一个类标签开始,最后一步将其缩减为4个功能和一个类标签. GA_ID和PC_ID列已被去除,而模型已经使用PN_ID,MBP_ID,GR_ID和AP_ID特征构造.
Here we can conclude that we started with 6 columns(features) and one class label and at the final step reduced down it to 4 features and one class label. GA_ID and PC_ID columns has been removed, while model has been constructed using PN_ID, MBP_ID, GR_ID and AP_ID features.
但不幸的是,当我在可用的方法进行特征选择scikit学习库我发现,它仅返回而不的选择并省略特征的名称形状相关的数据,减少数据.
But unfortunately when i performed feature selection with the available methods in scikit-learn library I found that it returns only shape of the data and reduced data without the name of the selected and omitted features.
我已经写下了许多愚蠢的python代码(因为我不是非常有经验的程序员)来找到答案,但没有成功.
I have write down many stupid python codes (as i m not very experience programmer) to find the answer but not succeeded.
请好心建议我某种方式感谢得到它
kindly please suggest me some way to get out of it thanks
(注:特别是对于这个后我从来没有执行对给定的示例数据集的任何特征选择方法,而我已删除的列随机解释的情况下)
(Note: Particularly for this post i have never performed any feature selection method on the given example data set, rather i have deleted the column randomly to explain the case)
推荐答案
也许这代码并评论解释将帮助(改编自此处).
Perhaps this code and commented explanations will help (adapted from here).
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.ensemble import ExtraTreesClassifier
# Build a classification task using 3 informative features
X, y = make_classification(n_samples=1000,
n_features=10,
n_informative=3,
n_redundant=0,
n_repeated=0,
n_classes=2,
random_state=0,
shuffle=False)
# Build a forest and compute the feature importances
forest = ExtraTreesClassifier(n_estimators=250,
random_state=0)
forest.fit(X, y)
importances = forest.feature_importances_ #array with importances of each feature
idx = np.arange(0, X.shape[1]) #create an index array, with the number of features
features_to_keep = idx[importances > np.mean(importances)] #only keep features whose importance is greater than the mean importance
#should be about an array of size 3 (about)
print features_to_keep.shape
x_feature_selected = X[:,features_to_keep] #pull X values corresponding to the most important features
print x_feature_selected
这篇关于我们怎样才能知道选择和省略的功能(列)名称(标题)使用scikit学习的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!