如何为xgboost实施增量培训? [英] How can I implement incremental training for xgboost?

查看：124 发布时间：2020/5/4 8:56:06 python machine-learning xgboost

本文介绍了如何为xgboost实施增量培训?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

问题是由于火车数据大小，我的火车数据无法放入RAM.因此，我需要一种方法，该方法首先在整个火车数据集上构建一棵树，然后计算残差来构建另一棵树，依此类推(就像梯度增强树一样).显然，如果我在某个循环中调用model = xgb.train(param, batch_dtrain, 2)，则无济于事，因为在这种情况下，它只会为每个批次重建整个模型.

The problem is that my train data could not be placed into RAM due to train data size. So I need a method which first builds one tree on whole train data set, calculate residuals build another tree and so on (like gradient boosted tree do). Obviously if I call model = xgb.train(param, batch_dtrain, 2) in some loop - it will not help, because in such case it just rebuilds whole model for each batch.

推荐答案

免责声明:我也是xgboost的新手，但我想我已经明白了.

Disclaimer: I'm new to xgboost as well, but I think I figured this out.

在进行第一批训练后，请尝试保存模型.然后，在连续运行中，向xgb.train方法提供已保存模型的文件路径.

Try saving your model after you train on the first batch. Then, on successive runs, provide the xgb.train method with the filepath of the saved model.

这是我进行的一个小实验，目的是让自己确信它是有效的:

Here's a small experiment that I ran to convince myself that it works:

首先，将波士顿数据集分为训练和测试集. 然后将训练集分成两半. 将模型与前半部分拟合，并得到一个分数作为基准. 然后在后半部分安装两个模型；一个模型将具有附加参数 xgb_model .如果传递的额外参数没有影响，那么我们希望他们的得分是相似的. 但是，幸运的是，新模型的性能似乎比第一个模型好得多.

First, split the boston dataset into training and testing sets. Then split the training set into halves. Fit a model with the first half and get a score that will serve as a benchmark. Then fit two models with the second half; one model will have the additional parameter xgb_model. If passing in the extra parameter didn't make a difference, then we would expect their scores to be similar.. But, fortunately, the new model seems to perform much better than the first.

import xgboost as xgb
from sklearn.cross_validation import train_test_split as ttsplit
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error as mse

X = load_boston()['data']
y = load_boston()['target']

# split data into training and testing sets
# then split training set in half
X_train, X_test, y_train, y_test = ttsplit(X, y, test_size=0.1, random_state=0)
X_train_1, X_train_2, y_train_1, y_train_2 = ttsplit(X_train, 
                                                     y_train, 
                                                     test_size=0.5,
                                                     random_state=0)

xg_train_1 = xgb.DMatrix(X_train_1, label=y_train_1)
xg_train_2 = xgb.DMatrix(X_train_2, label=y_train_2)
xg_test = xgb.DMatrix(X_test, label=y_test)

params = {'objective': 'reg:linear', 'verbose': False}
model_1 = xgb.train(params, xg_train_1, 30)
model_1.save_model('model_1.model')

# ================= train two versions of the model =====================#
model_2_v1 = xgb.train(params, xg_train_2, 30)
model_2_v2 = xgb.train(params, xg_train_2, 30, xgb_model='model_1.model')

print(mse(model_1.predict(xg_test), y_test))     # benchmark
print(mse(model_2_v1.predict(xg_test), y_test))  # "before"
print(mse(model_2_v2.predict(xg_test), y_test))  # "after"

# 23.0475232194
# 39.6776876084
# 27.2053239482

让我知道是否有任何不清楚的地方！

Let me know if anything is unclear!

参考: https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/training.py

这篇关于如何为xgboost实施增量培训?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何为xgboost实施增量培训? [英] How can I implement incremental training for xgboost?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

如何为xgboost实施增量培训? [英] How can I implement incremental training for xgboost?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭