了解Spark RandomForest功能重要结果 [英] Understanding Spark RandomForest featureImportances results

查看:165
本文介绍了了解Spark RandomForest功能重要结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用RandomForest.featureImportances,但我不理解输出结果.

I'm using RandomForest.featureImportances but I don't understand the output result.

我有12个功能,这是我得到的输出.

I have 12 features, and this is the output I get.

我知道这可能不是特定于apache-spark的问题,但是我找不到任何可以解释输出的信息.

I get that this might not be an apache-spark specific question but I cannot find anywhere that explains the output.

// org.apache.spark.mllib.linalg.Vector = (12,[0,1,2,3,4,5,6,7,8,9,10,11],
 [0.1956128039688559,0.06863606797951556,0.11302128590305296,0.091986700351889,0.03430651625283274,0.05975817050022879,0.06929766152519388,0.052654922125615934,0.06437052114945474,0.1601713590349946,0.0324327322375338,0.057751258970832206])

推荐答案

给出一个树集成模型,RandomForest.featureImportances计算每个特征的重要性.

Given a tree ensemble model, RandomForest.featureImportances computes the importance of each feature.

根据Leo Breiman和Adele Cutler在"Random Forests"文档中对Gini重要性的解释以及scikit-learn的实施,这将"Gini"重要性的概念推广到其他损失.

This generalizes the idea of "Gini" importance to other losses, following the explanation of Gini importance from "Random Forests" documentation by Leo Breiman and Adele Cutler, and following the implementation from scikit-learn.

对于树的收集,包括提振和套袋,Hastie等人.建议使用整体中所有树的单树重要性的平均值.

For collections of trees, which includes boosting and bagging, Hastie et al. suggests to use the average of single tree importances across all trees in the ensemble.

此功能的重要性计算如下:

And this feature importance is calculated as followed :

  • 树木平均:
    • 重要性(特征j)=增益的总和(在特征j上分割的节点上),其中增益通过通过节点的实例数来缩放
    • 将树的重要性归一化为1.
    • Average over trees:
      • importance(feature j) = sum (over nodes which split on feature j) of the gain, where gain is scaled by the number of instances passing through node
      • Normalize importances for tree to sum to 1.

      参考文献: Hastie,蒂布希拉尼(Tibshirani),弗里德曼(Friedman). 统计学习的要素,第二版." 2001. -15.3.2变量重要性第593页.

      References: Hastie, Tibshirani, Friedman. "The Elements of Statistical Learning, 2nd Edition." 2001. - 15.3.2 Variable Importance page 593.

      让我们回到您的重要性向量:

      Let's go back to your importance vector :

      val importanceVector = Vectors.sparse(12,Array(0,1,2,3,4,5,6,7,8,9,10,11), Array(0.1956128039688559,0.06863606797951556,0.11302128590305296,0.091986700351889,0.03430651625283274,0.05975817050022879,0.06929766152519388,0.052654922125615934,0.06437052114945474,0.1601713590349946,0.0324327322375338,0.057751258970832206))
      

      首先,让我们按重要性对这些功能进行排序:

      First, let's sort this features by importance :

      importanceVector.toArray.zipWithIndex
                  .map(_.swap)
                  .sortBy(-_._2)
                  .foreach(x => println(x._1 + " -> " + x._2))
      // 0 -> 0.1956128039688559
      // 9 -> 0.1601713590349946
      // 2 -> 0.11302128590305296
      // 3 -> 0.091986700351889
      // 6 -> 0.06929766152519388
      // 1 -> 0.06863606797951556
      // 8 -> 0.06437052114945474
      // 5 -> 0.05975817050022879
      // 11 -> 0.057751258970832206
      // 7 -> 0.052654922125615934
      // 4 -> 0.03430651625283274
      // 10 -> 0.0324327322375338
      

      那是什么意思?

      这意味着您的第一个特征(索引0)是最重要的特征,权重约为0.19,而第11个特征(索引10)在模型中最不重要.

      It means that your first feature (index 0) is the most important feature with a weight of ~ 0.19 and your 11th (index 10) feature is the least important in your model.

      这篇关于了解Spark RandomForest功能重要结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆