Spark MLib决策树:按功能标注的概率? [英] Spark MLib Decision Trees: Probability of labels by features?
问题描述
我可以设法显示labels
的总概率,例如,在显示决策树后,我有了一个表:
I could manage to display total probabilities of my labels
, for example after displaying my decision tree, I have a table :
Total Predictions :
65% impressions
30% clicks
5% conversions
但是我的问题是按features
(按节点)查找概率(或计数),例如:
But my issue is to find probabilities (or to count) by features
(by node), for example :
if feature1 > 5
if feature2 < 10
Predict Impressions
samples : 30 Impressions
else feature2 >= 10
Predict Clicks
samples : 5 Clicks
Scikit
是自动执行的,我正在尝试找到一种使用Spark
Scikit
does it automatically , I am trying to find a way to do it with Spark
推荐答案
注意:以下解决方案仅适用于Scala.我没有找到在Python中执行此操作的方法.
假设您只想像示例中那样直观地表示树,也许一种选择是改编Node.scala
代码,以在每个GitHub上包含概率节点拆分,如以下代码片段所示:
Assuming you just want a visual representation of the tree as in your example, maybe one option is to adapt the method subtreeToString
present in the Node.scala
code on Spark's GitHub to include the probabilities at each node split, like in the following snippet:
def subtreeToString(rootNode: Node, indentFactor: Int = 0): String = {
def splitToString(split: Split, left: Boolean): String = {
split.featureType match {
case Continuous => if (left) {
s"(feature ${split.feature} <= ${split.threshold})"
} else {
s"(feature ${split.feature} > ${split.threshold})"
}
case Categorical => if (left) {
s"(feature ${split.feature} in ${split.categories.mkString("{", ",", "}")})"
} else {
s"(feature ${split.feature} not in ${split.categories.mkString("{", ",", "}")})"
}
}
}
val prefix: String = " " * indentFactor
if (rootNode.isLeaf) {
prefix + s"Predict: ${rootNode.predict.predict} \n"
} else {
val prob = rootNode.predict.prob*100D
prefix + s"If ${splitToString(rootNode.split.get, left = true)} " + f"(Prob: $prob%04.2f %%)" + "\n" +
subtreeToString(rootNode.leftNode.get, indentFactor + 1) +
prefix + s"Else ${splitToString(rootNode.split.get, left = false)} " + f"(Prob: ${100-prob}%04.2f %%)" + "\n" +
subtreeToString(rootNode.rightNode.get, indentFactor + 1)
}
}
我已经在可以使用类似的方法使用此信息来创建树结构.主要区别是将打印的信息(split.feature
,split.threshold
,predict.prob
等)存储为val,并使用它们来构建结构.
A similar approach could be used for creating a tree structure with this information. The main difference would be to store the printed information (split.feature
, split.threshold
, predict.prob
, and so on) as vals and use them to build the structure.
这篇关于Spark MLib决策树:按功能标注的概率?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!