Spark MLib决策树:按功能标注的概率? [英] Spark MLib Decision Trees: Probability of labels by features?

查看:86
本文介绍了Spark MLib决策树:按功能标注的概率?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以设法显示labels的总概率,例如,在显示决策树后,我有了一个表:

I could manage to display total probabilities of my labels, for example after displaying my decision tree, I have a table :

Total Predictions :
    65% impressions
    30% clicks
    5%  conversions

但是我的问题是按features(按节点)查找概率(或计数),例如:

But my issue is to find probabilities (or to count) by features (by node), for example :

if feature1 > 5
   if feature2 < 10
      Predict Impressions
      samples : 30 Impressions
   else feature2 >= 10
      Predict Clicks
      samples : 5 Clicks

Scikit是自动执行的,我正在尝试找到一种使用Spark

Scikit does it automatically , I am trying to find a way to do it with Spark

推荐答案

注意:以下解决方案仅适用于Scala.我没有找到在Python中执行此操作的方法.

假设您只想像示例中那样直观地表示树,也许一种选择是改编subtreeToString Spark的GitHub上的/master/mllib/src/main/scala/org/apache/spark/mllib/tree/model/Node.scala"rel =" nofollow> Node.scala 代码,以在每个GitHub上包含概率节点拆分,如以下代码片段所示:

Assuming you just want a visual representation of the tree as in your example, maybe one option is to adapt the method subtreeToString present in the Node.scala code on Spark's GitHub to include the probabilities at each node split, like in the following snippet:

def subtreeToString(rootNode: Node, indentFactor: Int = 0): String = {
  def splitToString(split: Split, left: Boolean): String = {
    split.featureType match {
      case Continuous => if (left) {
        s"(feature ${split.feature} <= ${split.threshold})"
      } else {
        s"(feature ${split.feature} > ${split.threshold})"
      }
      case Categorical => if (left) {
        s"(feature ${split.feature} in ${split.categories.mkString("{", ",", "}")})"
      } else {
        s"(feature ${split.feature} not in ${split.categories.mkString("{", ",", "}")})"
      }
    }
  }
  val prefix: String = " " * indentFactor
  if (rootNode.isLeaf) {
    prefix + s"Predict: ${rootNode.predict.predict} \n"
  } else {
    val prob = rootNode.predict.prob*100D
    prefix + s"If ${splitToString(rootNode.split.get, left = true)} " + f"(Prob: $prob%04.2f %%)" + "\n" +
      subtreeToString(rootNode.leftNode.get, indentFactor + 1) +
      prefix + s"Else ${splitToString(rootNode.split.get, left = false)} " + f"(Prob: ${100-prob}%04.2f %%)" + "\n" +
      subtreeToString(rootNode.rightNode.get, indentFactor + 1)
  }
}

我已经在可以使用类似的方法使用此信息来创建树结构.主要区别是将打印的信息(split.featuresplit.thresholdpredict.prob等)存储为val,并使用它们来构建结构.

A similar approach could be used for creating a tree structure with this information. The main difference would be to store the printed information (split.feature, split.threshold, predict.prob, and so on) as vals and use them to build the structure.

这篇关于Spark MLib决策树:按功能标注的概率?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆