使用scikit-learn消除随机森林上的递归特征 [英] Recursive feature elimination on Random Forest using scikit-learn
问题描述
我正在尝试使用scikit-learn
和随机森林分类器执行递归特征消除,并以OOB ROC作为对在递归过程中创建的每个子集进行评分的方法.
I'm trying to preform recursive feature elimination using scikit-learn
and a random forest classifier, with OOB ROC as the method of scoring each subset created during the recursive process.
但是,当我尝试使用RFECV
方法时,出现错误消息AttributeError: 'RandomForestClassifier' object has no attribute 'coef_'
However, when I try to use the RFECV
method, I get an error saying AttributeError: 'RandomForestClassifier' object has no attribute 'coef_'
随机森林本身没有系数,但是按基尼分数却具有一定的排名.所以,我想知道如何解决这个问题.
Random Forests don't have coefficients per se, but they do have rankings by Gini score. So, I'm wondering how to get arround this problem.
请注意,我要使用一种方法,该方法将明确告诉我pandas
DataFrame中的哪些功能已在最佳分组中选择,因为我正在使用递归功能选择以尽量减少要输入的数据量最终分类器.
Please note that I want to use a method that will explicitly tell me what features from my pandas
DataFrame were selected in the optimal grouping as I am using recursive feature selection to try to minimize the amount of data I will input into the final classifier.
下面是一些示例代码:
from sklearn import datasets
import pandas as pd
from pandas import Series
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV
iris = datasets.load_iris()
x=pd.DataFrame(iris.data, columns=['var1','var2','var3', 'var4'])
y=pd.Series(iris.target, name='target')
rf = RandomForestClassifier(n_estimators=500, min_samples_leaf=5, n_jobs=-1)
rfecv = RFECV(estimator=rf, step=1, cv=10, scoring='ROC', verbose=2)
selector=rfecv.fit(x, y)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/feature_selection/rfe.py", line 336, in fit
ranking_ = rfe.fit(X_train, y_train).ranking_
File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/feature_selection/rfe.py", line 148, in fit
if estimator.coef_.ndim > 1:
AttributeError: 'RandomForestClassifier' object has no attribute 'coef_'
推荐答案
这就是我的想法.这是一个非常简单的解决方案,并且由于我要对高度不平衡的数据集进行分类,因此它依赖于自定义精度指标(称为weightedAccuracy).但是,如果需要的话,应该容易地使其更具可扩展性.
Here's what I ginned up. It's a pretty simple solution, and relies on a custom accuracy metric (called weightedAccuracy) since I'm classifying a highly unbalanced dataset. But, it should be easily made more extensible if desired.
from sklearn import datasets
import pandas
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
from sklearn.metrics import confusion_matrix
def get_enhanced_confusion_matrix(actuals, predictions, labels):
""""enhances confusion_matrix by adding sensivity and specificity metrics"""
cm = confusion_matrix(actuals, predictions, labels = labels)
sensitivity = float(cm[1][1]) / float(cm[1][0]+cm[1][1])
specificity = float(cm[0][0]) / float(cm[0][0]+cm[0][1])
weightedAccuracy = (sensitivity * 0.9) + (specificity * 0.1)
return cm, sensitivity, specificity, weightedAccuracy
iris = datasets.load_iris()
x=pandas.DataFrame(iris.data, columns=['var1','var2','var3', 'var4'])
y=pandas.Series(iris.target, name='target')
response, _ = pandas.factorize(y)
xTrain, xTest, yTrain, yTest = cross_validation.train_test_split(x, response, test_size = .25, random_state = 36583)
print "building the first forest"
rf = RandomForestClassifier(n_estimators = 500, min_samples_split = 2, n_jobs = -1, verbose = 1)
rf.fit(xTrain, yTrain)
importances = pandas.DataFrame({'name':x.columns,'imp':rf.feature_importances_
}).sort(['imp'], ascending = False).reset_index(drop = True)
cm, sensitivity, specificity, weightedAccuracy = get_enhanced_confusion_matrix(yTest, rf.predict(xTest), [0,1])
numFeatures = len(x.columns)
rfeMatrix = pandas.DataFrame({'numFeatures':[numFeatures],
'weightedAccuracy':[weightedAccuracy],
'sensitivity':[sensitivity],
'specificity':[specificity]})
print "running RFE on %d features"%numFeatures
for i in range(1,numFeatures,1):
varsUsed = importances['name'][0:i]
print "now using %d of %s features"%(len(varsUsed), numFeatures)
xTrain, xTest, yTrain, yTest = cross_validation.train_test_split(x[varsUsed], response, test_size = .25)
rf = RandomForestClassifier(n_estimators = 500, min_samples_split = 2,
n_jobs = -1, verbose = 1)
rf.fit(xTrain, yTrain)
cm, sensitivity, specificity, weightedAccuracy = get_enhanced_confusion_matrix(yTest, rf.predict(xTest), [0,1])
print("\n"+str(cm))
print('the sensitivity is %d percent'%(sensitivity * 100))
print('the specificity is %d percent'%(specificity * 100))
print('the weighted accuracy is %d percent'%(weightedAccuracy * 100))
rfeMatrix = rfeMatrix.append(
pandas.DataFrame({'numFeatures':[len(varsUsed)],
'weightedAccuracy':[weightedAccuracy],
'sensitivity':[sensitivity],
'specificity':[specificity]}), ignore_index = True)
print("\n"+str(rfeMatrix))
maxAccuracy = rfeMatrix.weightedAccuracy.max()
maxAccuracyFeatures = min(rfeMatrix.numFeatures[rfeMatrix.weightedAccuracy == maxAccuracy])
featuresUsed = importances['name'][0:maxAccuracyFeatures].tolist()
print "the final features used are %s"%featuresUsed
这篇关于使用scikit-learn消除随机森林上的递归特征的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!