使用随机森林的基于AUC的特征重要性 [英] AUC-base Features Importance using Random Forest

查看:1261
本文介绍了使用随机森林的基于AUC的特征重要性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过随机森林和逻辑回归来预测一个二进制变量.我有严重失衡的班级(约占Y = 1的1.5%).

I'm trying to predict a binary variable with both random forests and logistic regression. I've got heavily unbalanced classes (approx 1.5% of Y=1).

随机森林中的默认特征重要性技术基于分类准确度(错误率)-对于不平衡的类,这已被证明是一种不好的衡量标准(请参见

The default feature importance techniques in random forests are based on classification accuracy (error rate) - which has been shown to be a bad measure for unbalanced classes (see here and here).

用于通过RF选择功能的两个标准VIM是Gini VIM和置换VIM.粗略地讲,目标预测器的Gini VIM是森林中该预测器每次被选择进行拆分时所产生的Gini杂质减少量的总和,并按树的数量进行缩放.

The two standard VIMs for feature selection with RF are the Gini VIM and the permutation VIM. Roughly speaking the Gini VIM of a predictor of interest is the sum over the forest of the decreases of Gini impurity generated by this predictor whenever it was selected for splitting, scaled by the number of trees.

我的问题是:这种方法是在scikit-learn中实现的吗(就像在R包party中一样)?还是解决方法?

My question is : is that kind of method implemented in scikit-learn (like it is in the R package party) ? Or maybe a workaround ?

PS:此问题与推荐答案

做完一些研究之后,我得出了这样的结论:

After doing some researchs, this is what I came out with :

from sklearn.cross_validation import ShuffleSplit
from collections import defaultdict

names = db_train.iloc[:,1:].columns.tolist()

# -- Gridsearched parameters
model_rf = RandomForestClassifier(n_estimators=500,
                                 class_weight="auto",
                                 criterion='gini',
                                 bootstrap=True,
                                 max_features=10,
                                 min_samples_split=1,
                                 min_samples_leaf=6,
                                 max_depth=3,
                                 n_jobs=-1)
scores = defaultdict(list)

# -- Fit the model (could be cross-validated)
rf = model_rf.fit(X_train, Y_train)
acc = roc_auc_score(Y_test, rf.predict(X_test))

for i in range(X_train.shape[1]):
    X_t = X_test.copy()
    np.random.shuffle(X_t[:, i])
    shuff_acc = roc_auc_score(Y_test, rf.predict(X_t))
    scores[names[i]].append((acc-shuff_acc)/acc)

print("Features sorted by their score:")
print(sorted([(round(np.mean(score), 4), feat) for
              feat, score in scores.items()], reverse=True))

Features sorted by their score:
[(0.0028999999999999998, 'Var1'), (0.0027000000000000001, 'Var2'), (0.0023999999999999998, 'Var3'), (0.0022000000000000001, 'Var4'), (0.0022000000000000001, 'Var5'), (0.0022000000000000001, 'Var6'), (0.002, 'Var7'), (0.002, 'Var8'), ...]

输出不是很性感,但是您明白了.这种方法的缺点是功能重要性似乎非常依赖于参数.我使用不同的参数(max_depthmax_features ..)运行了它,结果得到了很多不同.因此,我决定对参数(scoring = 'roc_auc')进行网格搜索,然后将此VIM(变量重要性测度)应用于最佳模型.

The output is not very sexy, but you got the idea. The weakness of this approach is that feature importance seems to be very parameters dependent. I ran it using differents params (max_depth, max_features..) and I'm getting a lot different results. So I decided to run a gridsearch on parameters (scoring = 'roc_auc') and then apply this VIM (Variable Importance Measure) to the best model.

我从这个(伟大的)笔记本.

最欢迎所有建议/评论!

All suggestions/comments are most welcome !

这篇关于使用随机森林的基于AUC的特征重要性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆