如何为 xgboost 实施增量训练? [英] How can I implement incremental training for xgboost?

查看:33
本文介绍了如何为 xgboost 实施增量训练?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题是由于火车数据的大小,我的火车数据无法放入 RAM 中.所以我需要一种方法,它首先在整个训练数据集上构建一棵树,计算残差构建另一棵树等等(就像梯度提升树那样).显然,如果我在某个循环中调用 model = xgb.train(param, batch_dtrain, 2) - 这将无济于事,因为在这种情况下,它只会为每个批次重建整个模型.

The problem is that my train data could not be placed into RAM due to train data size. So I need a method which first builds one tree on whole train data set, calculate residuals build another tree and so on (like gradient boosted tree do). Obviously if I call model = xgb.train(param, batch_dtrain, 2) in some loop - it will not help, because in such case it just rebuilds whole model for each batch.

推荐答案

在第一批训练后尝试保存模型.然后,在连续运行中,为 xgb.train 方法提供保存模型的文件路径.

Try saving your model after you train on the first batch. Then, on successive runs, provide the xgb.train method with the filepath of the saved model.

这是我进行的一个小实验,以说服自己它有效:

Here's a small experiment that I ran to convince myself that it works:

首先,将波士顿数据集拆分为训练集和测试集.然后将训练集分成两半.将模型与上半场拟合并获得可作为基准的分数.然后用下半部分拟合两个模型;一个模型将具有附加参数 xgb_model.如果传入额外的参数没有影响,那么我们希望它们的分数相似..但是,幸运的是,新模型的性能似乎比第一个要好得多.

First, split the boston dataset into training and testing sets. Then split the training set into halves. Fit a model with the first half and get a score that will serve as a benchmark. Then fit two models with the second half; one model will have the additional parameter xgb_model. If passing in the extra parameter didn't make a difference, then we would expect their scores to be similar.. But, fortunately, the new model seems to perform much better than the first.

import xgboost as xgb
from sklearn.cross_validation import train_test_split as ttsplit
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error as mse

X = load_boston()['data']
y = load_boston()['target']

# split data into training and testing sets
# then split training set in half
X_train, X_test, y_train, y_test = ttsplit(X, y, test_size=0.1, random_state=0)
X_train_1, X_train_2, y_train_1, y_train_2 = ttsplit(X_train, 
                                                     y_train, 
                                                     test_size=0.5,
                                                     random_state=0)

xg_train_1 = xgb.DMatrix(X_train_1, label=y_train_1)
xg_train_2 = xgb.DMatrix(X_train_2, label=y_train_2)
xg_test = xgb.DMatrix(X_test, label=y_test)

params = {'objective': 'reg:linear', 'verbose': False}
model_1 = xgb.train(params, xg_train_1, 30)
model_1.save_model('model_1.model')

# ================= train two versions of the model =====================#
model_2_v1 = xgb.train(params, xg_train_2, 30)
model_2_v2 = xgb.train(params, xg_train_2, 30, xgb_model='model_1.model')

print(mse(model_1.predict(xg_test), y_test))     # benchmark
print(mse(model_2_v1.predict(xg_test), y_test))  # "before"
print(mse(model_2_v2.predict(xg_test), y_test))  # "after"

# 23.0475232194
# 39.6776876084
# 27.2053239482

参考:https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/training.py

这篇关于如何为 xgboost 实施增量训练?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆