如何为xgboost实施增量培训? [英] How can I implement incremental training for xgboost?

查看:124
本文介绍了如何为xgboost实施增量培训?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题是由于火车数据大小,我的火车数据无法放入RAM.因此,我需要一种方法,该方法首先在整个火车数据集上构建一棵树,然后计算残差来构建另一棵树,依此类推(就像梯度增强树一样).显然,如果我在某个循环中调用model = xgb.train(param, batch_dtrain, 2),则无济于事,因为在这种情况下,它只会为每个批次重建整个模型.

The problem is that my train data could not be placed into RAM due to train data size. So I need a method which first builds one tree on whole train data set, calculate residuals build another tree and so on (like gradient boosted tree do). Obviously if I call model = xgb.train(param, batch_dtrain, 2) in some loop - it will not help, because in such case it just rebuilds whole model for each batch.

推荐答案

免责声明:我也是xgboost的新手,但我想我已经明白了.

Disclaimer: I'm new to xgboost as well, but I think I figured this out.

在进行第一批训练后,请尝试保存模型.然后,在连续运行中,向xgb.train方法提供已保存模型的文件路径.

Try saving your model after you train on the first batch. Then, on successive runs, provide the xgb.train method with the filepath of the saved model.

这是我进行的一个小实验,目的是让自己确信它是有效的:

Here's a small experiment that I ran to convince myself that it works:

首先,将波士顿数据集分为训练和测试集. 然后将训练集分成两半. 将模型与前半部分拟合,并得到一个分数作为基准. 然后在后半部分安装两个模型;一个模型将具有附加参数 xgb_model .如果传递的额外参数没有影响,那么我们希望他们的得分是相似的. 但是,幸运的是,新模型的性能似乎比第一个模型好得多.

First, split the boston dataset into training and testing sets. Then split the training set into halves. Fit a model with the first half and get a score that will serve as a benchmark. Then fit two models with the second half; one model will have the additional parameter xgb_model. If passing in the extra parameter didn't make a difference, then we would expect their scores to be similar.. But, fortunately, the new model seems to perform much better than the first.

import xgboost as xgb
from sklearn.cross_validation import train_test_split as ttsplit
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error as mse

X = load_boston()['data']
y = load_boston()['target']

# split data into training and testing sets
# then split training set in half
X_train, X_test, y_train, y_test = ttsplit(X, y, test_size=0.1, random_state=0)
X_train_1, X_train_2, y_train_1, y_train_2 = ttsplit(X_train, 
                                                     y_train, 
                                                     test_size=0.5,
                                                     random_state=0)

xg_train_1 = xgb.DMatrix(X_train_1, label=y_train_1)
xg_train_2 = xgb.DMatrix(X_train_2, label=y_train_2)
xg_test = xgb.DMatrix(X_test, label=y_test)

params = {'objective': 'reg:linear', 'verbose': False}
model_1 = xgb.train(params, xg_train_1, 30)
model_1.save_model('model_1.model')

# ================= train two versions of the model =====================#
model_2_v1 = xgb.train(params, xg_train_2, 30)
model_2_v2 = xgb.train(params, xg_train_2, 30, xgb_model='model_1.model')

print(mse(model_1.predict(xg_test), y_test))     # benchmark
print(mse(model_2_v1.predict(xg_test), y_test))  # "before"
print(mse(model_2_v2.predict(xg_test), y_test))  # "after"

# 23.0475232194
# 39.6776876084
# 27.2053239482

让我知道是否有任何不清楚的地方!

Let me know if anything is unclear!

参考: https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/training.py

这篇关于如何为xgboost实施增量培训?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆