xgboost 质量是如何计算的? [英] How is xgboost quality calculated?
问题描述
谁能解释一下xgb.model.dt.tree
函数中xgboost R包中的Quality
列是如何计算的?
Could someone explain how the Quality
column in the xgboost R package is calculated in the xgb.model.dt.tree
function?
在文档中它说Quality
是与此特定节点中的分裂相关的增益".
In the documentation it says that Quality
"is the gain related to the split in this specific node".
当您运行此函数的 xgboost 文档中给出的以下代码时,树 0 的节点 0 的 Quality
为 4000.53,但我计算的 Gain
为 2002.848
When you run the following code, given in the xgboost documentation for this function, Quality
for node 0 of tree 0 is 4000.53, yet I calculate the Gain
as 2002.848
data(agaricus.train, package='xgboost')
train <- agarics.train
X = train$data
y = train$label
bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
eta = 1, nthread = 2, nround = 2,objective = "binary:logistic")
xgb.model.dt.tree(agaricus.train$data@Dimnames[[2]], model = bst)
p = rep(0.5,nrow(X))
L = which(X[,'odor=none']==0)
R = which(X[,'odor=none']==1)
pL = p[L]
pR = p[R]
yL = y[L]
yR = y[R]
GL = sum(pL-yL)
GR = sum(pR-yR)
G = sum(p-y)
HL = sum(pL*(1-pL))
HR = sum(pR*(1-pR))
H = sum(p*(1-p))
gain = 0.5 * (GL^2/HL+GR^2/HR-G^2/H)
gain
我了解 Gain
由以下公式给出:
I understand that Gain
is given by the following formula:
由于我们使用对数损失,G 是 py
的总和,H 是 p(1-p)
的总和 - 在这种情况下,gamma 和 lambda 是都是零.
Since we are using log loss, G is the sum of p-y
and H is the sum of p(1-p)
- gamma and lambda in this instance are both zero.
谁能指出我哪里出错了?
Can anyone identify where I am going wrong?
推荐答案
好的,我想我已经解决了.reg_lambda
的值不是文档中给出的默认值 0,但实际上是 1(来自 param.h)
OK, I think I've worked it out. The value for reg_lambda
is not 0 by default as given in the documentation, but is actually 1 (from param.h)
此外,在计算增益时似乎没有应用一半的系数,因此质量列是您期望的两倍.最后,我也不认为 gamma
(也称为 min_split_loss
)应用于此计算(来自 update_hitmaker-inl.hpp)
Also, it appears that the factor of a half is not applied when calculating the gain, so the Quality column is double what you would expect. Lastly, I also don't think gamma
(also called min_split_loss
) is applied to this calculation either (from update_hitmaker-inl.hpp)
相反,gamma 用于确定是否调用修剪,但并未反映在增益计算本身中,正如文档所建议的那样.
Instead, gamma is used to determine whether to invoke pruning, but is not reflected in the gain calculation itself, as the documentation suggests.
如果您应用这些更改,您确实会得到 4000.53 作为树 0 的节点 0 的 Quality
,如原始问题所示.我会将此作为问题提交给 xgboost 人员,以便相应地更改文档.
If you apply these changes, you do indeed get 4000.53 as the Quality
for node 0 of tree 0, as in the original question. I'll raise this as an issue to the xgboost guys, so the documentation can be changed accordingly.
这篇关于xgboost 质量是如何计算的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!