为什么单个树的xgboost回归预测存在差异? [英] Why are there discrepancies in xgboost regression prediction from individual trees?

查看:311
本文介绍了为什么单个树的xgboost回归预测存在差异?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,我运行一个非常简单的xgb回归模型,该模型仅包含2棵树,每棵树只有1片叶子. 此处. (我了解这是一个分类数据集,但我只是强迫回归在此处演示问题):

First I run a very simple xgb regression model which contains only 2 trees with 1 leaf each. Data available here. (I understand this is a classification dataset but I just force the regression to demonstrate the question here):

import numpy as np
from numpy import loadtxt
from xgboost import XGBClassifier,XGBRegressor
from xgboost import plot_tree
import matplotlib.pyplot as plt

plt.rc('figure', figsize=[10,7])


# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
y = dataset[:,8]
# fit model no training data
model = XGBRegressor(max_depth=0, learning_rate=0.1, n_estimators=2,random_state=123)
model.fit(X, y)

在绘制树时,我们看到2棵树给出的预测值为 -0.0150845 -0.013578

Plotting the trees, we see that the 2 trees give a prediction value of -0.0150845 and -0.013578

plot_tree(model, num_trees=0) # 1ST tree, gives -0.0150845
plot_tree(model, num_trees=1) # 2ND tree, gives -0.013578

但是,如果我们对第一棵树和两棵树进行预测,则它们会给出合理的值:

But if we run predictions with the 1st tree and both trees, they give reasonable values:

print(X[0])
print(model.predict(X[0,None],ntree_limit=1)) # 1st tree only
print(model.predict(X[0,None],ntree_limit=0)) # ntree_limit=0: use all trees

# output:
#[  6.    148.     72.     35.      0.     33.6     0.627  50.   ]
#[0.48491547]
#[0.47133744]

所以这里有两个问题:

  1. 树木的预测"-0.0150845"和"-0.013578"如何与最终输出"0.48491547"和"0.48491547"相关?显然,这里正在进行一些转换.
  2. 如果树上只有一片叶子,为了最大程度地减少平方误差(XGBRegressor的默认目标),第一棵树是否应该仅预测y的样本均值,即〜0.3?

更新: 我发现了问题1:XGBRegressor中有一个默认参数 base_score = 0.5 ,该默认参数会改变预测(仅在二进制分类问题中才有意义). 但是对于Q2,即使在我设置了base_score=0之后,第一片叶子给出的值也接近y个样本平均值,但并不精确.因此,这里仍然缺少一些东西.

UPDATE: I figured out Q1: there is a base_score=0.5 default parameter in XGBRegressor which shifts the prediction (which only makes sense in binary classification problem). But for Q2, even after I set base_score=0, the first leaf gives value close to y sample mean, but not exact. So there is still something missing here.

推荐答案

此行为是梯度增强"树的特征.第一棵树包含数据的基本预测.因此,删除第一棵树将大大降低模型的性能.这是梯度提升的算法:
1. y_pred = 0,learning_rate = 0.x
2.在火车时间重复:
一世.残差=残差+学习率*(y-y_pred)
ii.第i个树= XGBRegressor(X,残差)
iii. y_pred =第i个tree.predict(X)
3.在测试时间重复:
一世.预测+ = learning_rate *第i个tree.predict(X_test)

回答第一个问题:因此,第一棵树可以预测大部分数据,而其他所有树都试图减少前一棵树的错误.这就是为什么您仅使用第一棵树观察到好的预测,而使用第二棵树观察到不好的预测的原因.您正在观察的是两棵树之间的错误.
回答第二个问题:并非所有框架都使用目标值的平均值来初始化残差的值.许多框架只是将其初始化为0.
如果您想可视化Gradient Boosting,这里有一个很好的链接
Youtube视频指导GDBT算法.
我希望这会有所帮助!

This behavior is a characteristic of Gradient Boosted trees. The first tree contains the base predictions of your data. So, dropping first tree will dramatically reduce the performance of your model. Here's the algorithm of gradient boosting:
1. y_pred = 0, learning_rate = 0.x
2. Repeat at train time:
i. residual = residual + learning_rate*(y - y_pred)
ii. i'th tree = XGBRegressor(X, residual)
iii. y_pred = i'th tree.predict(X)
3. Repeat at test time:
i. prediction += learning_rate*i'th tree.predict(X_test)

Answer to your first question: So, the first tree predicts most part of your data while all other trees try to reduce the error of the previous tree. This is the reason you observe good predictions using just first tree but bad ones using the second tree. What you are observing there is the error between your two trees.
Answer to your second question: Not all the frameworks initialize the value of residual using the mean of your target values. Many frameworks simply initalize it to 0.
If you want to visualize Gradient Boosting, here's a good link
Youtube video guiding through algorithm of GDBT.
I hope this helps!

这篇关于为什么单个树的xgboost回归预测存在差异?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆