sklearn随机森林索引的功能如何_重要_ [英] How does sklearn random forest index feature_importances_

查看:108
本文介绍了sklearn随机森林索引的功能如何_重要_的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已使用sklearn中的RandomForestClassifier确定数据集中的重要特征.如何返回实际的特征名称(我的变量分别标记为x1,x2,x3等),而不是它们的相对名称(它告诉我重要的特征是"12","22"等).下面是我目前用于返回重要功能的代码.

I have used the RandomForestClassifier in sklearn for determining the important features in my dataset. How am I able to return the actual feature names (my variables are labeled x1, x2, x3, etc.) rather than their relative name (it tells me the important features are '12', '22', etc.). Below is the code that I am currently using to return the important features.

important_features = []
for x,i in enumerate(rf.feature_importances_):
    if i>np.average(rf.feature_importances_):
        important_features.append(str(x))
print important_features

另外,为了理解索引,我能够找出重要特征'12'实际上是什么(变量x14).当我将变量x14移到训练数据集的0索引位置并再次运行代码时,它应该告诉我功能"0"很重要,但是不重要,就好像看不到该功能再说一遍,列出的第一个功能实际上就是我第一次运行代码时列出的第二个功能(功能"22").

Additionally, in an effort to understand the indexing, I was able to find out what the important feature '12' actually was (it was variable x14). When I move variable x14 into what would be the 0 index position for the training dataset and run the code again, it should then tell me that feature '0' is important, but it does not, it's like it can't see that feature anymore and the first feature listed is the feature that was actually the second feature listed when I ran the code the first time (feature '22').

我在想,Feature_importances_实际上是在使用第一列(我在其中放置了x14)作为其余训练数据集的一种ID,因此在选择重要特征时将其忽略了.谁能阐明这两个问题?预先感谢您的协助.

I'm thinking that perhaps feature_importances_ is actually using the first column (where I have placed x14) as a sort of ID for rest of the training dataset, and thus ignoring it in selecting important features. Can anyone shed some light on these two questions? Thank you in advance for any assistance.

编辑
这是我存储功能名称的方法:

EDIT
Here is how I stored the feature names:

tgmc_reader = csv.reader(csvfile)
row = tgmc_reader.next()    #Header contains feature names
feature_names = np.array(row)


然后我加载了数据集和目标类


Then I loaded the datasets and target classes

tgmc_x, tgmc_y = [], []
for row in tgmc_reader:
    tgmc_x.append(row[3:])    #This says predictors start at the 4th column, columns 2 and 3 are just considered ID variables.
    tgmc_y.append(row[0])     #Target column is the first in the dataset


然后将数据集分为测试部分和训练部分.


Then proceeded to split the dataset into testing and training portions.

from sklearn.cross_validation import train_test_split

x_train, x_test, y_train, y_test = train_test_split(tgmc_x, tgmc_y, test_size=.10, random_state=33)


然后拟合模型

Then fit the model

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=1, criterion='entropy', max_features=2, max_depth=5, bootstrap=True, oob_score=True, n_jobs=2, random_state=33)
rf = rf.fit(x_train, y_train)


然后返回重要功能


Then returned the important features

important_features = []
for x,i in enumerate(rf.feature_importances_):
    if i>np.average(rf.feature_importances_):
        important_features.append((x))


然后,我采纳了您的建议,该建议奏效了(非常感谢!)


Then I incorporated your suggestion which worked (Thank you very much!)

important_names = feature_names[important_features > np.mean(important_features)]
print important_names


它确实确实返回了变量名.


And it did indeed return variable names.

['x9' 'x10' 'x11' 'x12' 'x13' 'x15' 'x16']


因此,您确定已经解决了我的问题的一部分,这太棒了.但是当我返回打印重要功能的结果时

So you have solved one part of my question for sure, which is awesome. But when I go back to printing the results of my important features

print important_features


它返回以下输出:


It returns the following output:

[12, 22, 51, 67, 73, 75, 87, 91, 92, 106, 125, 150, 199, 206, 255, 256, 275, 309, 314, 317]


我将其解释为意味着它将第12、22、51等变量视为重要变量.所以这是从我告诉它在代码开始处索引观察点起的第12个变量:


I am interpreting this to mean that it considers the 12th,22nd, 51st, etc., variables to be the important ones. So this would be the 12th variable from the point where I told it to index the observations at the beginning of my code:

tgmc_x.append(row[3:])


这种解释正确吗? 如果正确的话,当我将第12个变量移到原始数据集中的第4列(我告诉它开始使用刚刚引用的代码开始读取预测变量值)并再次运行该代码时,将得到以下输出:


Is this interpretation correct? If this is correct, when I move the 12th variable to the 4th column in the original dataset(where I told it to start reading the predictor values with the code I just referenced) and run the code again, I get the following output:

[22, 51, 66, 73, 75, 76, 106, 112, 125, 142, 150, 187, 191, 199, 250, 259, 309, 317]


似乎不再识别该变量了,另外,当我将相同的变量移到原始数据集中的第5列时,输出如下所示:


This seems like its not recognizing that variable any longer.Additionally, when I move the same variable to the 5th column in the original dataset the output looks like this:

[1,22, 51, 66, 73, 75, 76, 106, 112, 125, 142, 150, 187, 191, 199, 250, 259, 309, 317]


看起来好像它再次识别了它.最后一件事,在我根据您的建议返回了变量名之后,它给了我7个变量的列表.当我仅使用最初执行的代码返回重要变量时,它会为我提供一长串重要变量.为什么是这样?再次感谢您的所有帮助.我真的很感激!


This looks like its recognizing it again. One last thing, after I got it to return the variable names via your suggestion, it gave me a list of 7 variables. When I just return the important variables using the code I did originally, it gives me a longer list of important variables. Why is this? Thank you again for all of your help. I really appreciate it!

推荐答案

功能重要性返回一个数组,其中每个索引对应于训练集中该功能的估计功能重要性.内部没有进行排序,它与训练过程中赋予的功能是一对一的对应关系.

Feature Importances returns an array where each index corresponds to the estimated feature importance of that feature in the training set. There is no sorting done internally, it is a 1-to-1 correspondence with the features given to it during training.

如果将要素名称存储为numpy数组,并确保其与传递给模型的要素一致,则可以利用numpy索引进行此操作.

If you stored your feature names as a numpy array and made sure it is consistent with the features passed to the model, you can take advantage of numpy indexing to do it.

importances = rf.feature_importances_
important_names = feature_names[importances > np.mean(importances)]
print important_names

这篇关于sklearn随机森林索引的功能如何_重要_的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆