XGBoost.如何从 xgb.dump 获取类的概率(multi:softprob 目标) [英] XGBoost. How to get probabilities of class from xgb.dump (multi:softprob objective)

查看:86
本文介绍了XGBoost.如何从 xgb.dump 获取类的概率(multi:softprob 目标)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 XGBoost 进行了 3 类分类预测.下一步是获取树模型(由 xgb.dump() 打印)并在 .net 生产系统中使用它.我真的不明白如何从休假的单个值中获得概率的 3-dim 值:

I've got 3-class classification predict using XGBoost. Next turn is get tree-model (printed by xgb.dump()) and use it in .net production system. I really do not understand how can i get 3-dim value of probabilities from single value in leave:

<code>
[1107] "booster[148]""0:[f24<1.5] yes=1,no=2,missing=1"          
[1109] "1:[f4<0.085] yes=3,no=4,missing=3""3:leaf=0.00624765"                         
[1111] "4:leaf=-0.0208106""2:[f4<0.115] yes=5,no=6,missing=5"         
[1113] "5:leaf=0.14725""6:leaf=0.0102657"  
</code>

附言由于速度限制,使用 .Net 中的 python 函数不是一个好主意

p.s. usinng python function from .Net is not good idea due to speed limitations

推荐答案

这花了一段时间才弄明白.一旦你得到你的树,接下来的步骤是

This took a while to figure out. Once you get your tree, The steps to follow are

  1. 找出每个助推器的叶子值.第一个助推器是 0 级,接下来是 1 级,接下来是 2 级,接下来是 0 级和 1 级,依此类推.所以基本上如果你有 10 个 num_round,你会看到 30 个助推器.

  1. Figure out the leaf values for each booster. The first booster is class 0 next is class 1 next is class 2 next is class 0 and class 1 and so on. So essentially if you have 10 num_round, you will see 30 boosters.

小心失踪".如果您没有特别提到 DMatrix 中的缺失值,xgb 可以将值 0 视为缺失值.因此,当您走下树时,当该节点的特征值为 0 时,您可能需要跳转到由 missing=x 表示的节点 x.解决这个令人困惑的事情的一种方法是确保在训练和预测时在 DMatrix 中放置了一个缺失值.我在我的数据中放置了一个不可能存在的值,并确保我在训练或预测之前通过用一些(非零)值替换它们来处理 NA 类型的值.显然 0 实际上可能意味着您丢失了,在这种情况下没关系.您实际上可能会在数据中为 1 或 0 的分类特征中注意到这一点,并且树中的节点在非常小的负数等情况下具有荒谬的条件.

Be careful about the "missing". If you have not specifically mentioned a missing value in the DMatrix, xgb can consider the value 0 as missing. So when you walk down your tree you might need to jump to the node x denoted by missing=x when you have the feature value as 0 for that node. One way of getting around this confusing thing is making sure you have put a missing value in the DMatrix when training and predicting. I put a value that is impossible to be present in my data and also made sure that I handle NA type of values by replacing them with some (non zero) value before I do train or predict. Obviously 0 can actually mean missing for you in which case that's OK. You might actually notice this thing in categorical features which have 1 or 0 in your data and the node in a tree has a ridiculous condition on a very small negative number etc.

假设你有 3 轮.然后你会得到这样的值l1_0,l2_0,l3_0 适用于 class 0 - l1_1,l2_1,l3_1 适用于 class 1,l1_2,l2_2,l3_2 适用于 class 2.

Let's say you have 3 rounds. Then you will end up with values like this l1_0,l2_0,l3_0 for class 0 - and l1_1,l2_1,l3_1 for class 1 and l1_2,l2_2,l3_2 for class 2.

现在,确保您获得正确逻辑的一个好方法是设置 output_margin 和 pred_leaf.一次一个.当您将 pred_leaf 设置为 on 时,您将获得一个矩阵,该矩阵将准确显示您应该为所有类点击哪个叶子,对于单个实例.当您将 output_margin 设置为 on 时,您将获得 xgb 正在计算的叶子值的总和.

Now, a great way of making sure you are getting the right logic is to set output_margin and pred_leaf on. One at at time. When you set pred_leaf on, you will get a matrix which will show exactly which leaf you should have hit for all of your classes, for a single instance. When you set output_margin on you will get the sum of the leaf values which xgb is calculating.

现在,对于 0 类,这个总和是 0.5 + l1_0+l2_0+l3_0,依此类推.您可以使用预测响应交叉验证这一点,以获取 output_margin 上的信息.这里 0.5 是偏差.

Now, this sum is 0.5 + l1_0+l2_0+l3_0 for class 0 and so on. You can cross verify this with the predict response to get with output_margin on. Here 0.5 is the bias.

现在假设您将 v0、v1 和 v2 作为偏差 + 叶值求和结果.那么你 0 类的概率是

Now say you got v0, v1 and v2 as the bias + leaf value summation result. Then you probability for class 0 is

    p(class0) = exp(v0)/(exp(v0)+exp(v1)+exp(v2))

这篇关于XGBoost.如何从 xgb.dump 获取类的概率(multi:softprob 目标)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆