XGBoost predict_proba 推理性能慢 [英] XGBoost predict_proba slow inference performance
问题描述
我使用 Scikit-learn 和 XGBoost 在相同数据上训练了 2 个梯度提升模型.
I trained 2 gradient-boosting models on the same data, using Scikit-learn and XGBoost.
Scikit-learn 模型
Scikit-learn model
GradientBoostingClassifier(
n_estimators=5,
learning_rate=0.17,
max_depth=5,
verbose=2
)
XGBoost 模型
XGBClassifier(
n_estimators=5,
learning_rate=0.17,
max_depth=5,
verbosity=2,
eval_metric="logloss"
)
然后我检查了推理性能:
Then I checked inference performance:
- Xgboost:每个循环 9.7 毫秒 ± 84.6 微秒
- Scikit-learn:每个循环 426 µs ± 12.5 µs
为什么 XGBoost 这么慢?
Why XGBoost is so slow?
推荐答案
为什么 xgboost 这么慢?":XGBClassifier()
是 XGBoost 的 scikit-learn API(参见例如 https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier 了解更多详情).如果您直接调用该函数(而不是通过 API),它会更快.为了比较这两个函数的性能,直接调用每个函数是有意义的,而不是直接调用一个函数和通过 API 调用一个函数.下面是一个例子:
"Why is xgboost so slow?": XGBClassifier()
is the scikit-learn API for XGBoost (see e.g. https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier for more details). If you call the function directly (not through an API) it will be faster. To compare the performance of the two functions it makes sense to call each function directly, instead of calling one function directly and one function through an API. Here is an example:
# benchmark_xgboost_vs_sklearn.py
# Adapted from `xgboost_test.py` by Jacob Schreiber
# (https://gist.github.com/jmschrei/6b447aada61d631544cd)
"""
Benchmarking scripts for XGBoost versus sklearn (time and accuracy)
"""
import time
import random
import numpy as np
import xgboost as xgb
from sklearn.ensemble import GradientBoostingClassifier
random.seed(0)
np.random.seed(0)
def make_dataset(n=500, d=10, c=2, z=2):
"""
Make a dataset of size n, with d dimensions and m classes,
with a distance of z in each dimension, making each feature equally
informative.
"""
# Generate our data and our labels
X = np.concatenate([np.random.randn(n, d) + z*i for i in range(c)])
y = np.concatenate([np.ones(n) * i for i in range(c)])
# Generate a random indexing
idx = np.arange(n*c)
np.random.shuffle(idx)
# Randomize the dataset, preserving data-label pairing
X = X[idx]
y = y[idx]
# Return x_train, x_test, y_train, y_test
return X[::2], X[1::2], y[::2], y[1::2]
def main():
"""
Run SKLearn, and then run xgboost,
then xgboost via SKLearn XGBClassifier API wrapper
"""
# Generate the dataset
X_train, X_test, y_train, y_test = make_dataset(10, z=100)
n_estimators=5
max_depth=5
learning_rate=0.17
# sklearn first
tic = time.time()
clf = GradientBoostingClassifier(n_estimators=n_estimators,
max_depth=max_depth, learning_rate=learning_rate)
clf.fit(X_train, y_train)
print("SKLearn GBClassifier: {}s".format(time.time() - tic))
print("Acc: {}".format(clf.score(X_test, y_test)))
print(y_test.sum())
print(clf.predict(X_test))
# Convert the data to DMatrix for xgboost
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
# Loop through multiple thread numbers for xgboost
for threads in 1, 2, 4:
# xgboost's sklearn interface
tic = time.time()
clf = xgb.XGBModel(n_estimators=n_estimators, max_depth=max_depth,
learning_rate=learning_rate, nthread=threads)
clf.fit(X_train, y_train)
print("SKLearn XGBoost API Time: {}s".format(time.time() - tic))
preds = np.round( clf.predict(X_test) )
acc = 1. - (np.abs(preds - y_test).sum() / y_test.shape[0])
print("Acc: {}".format( acc ))
print("{} threads: ".format( threads ))
tic = time.time()
param = {
'max_depth' : max_depth,
'eta' : 0.1,
'silent': 1,
'objective':'binary:logistic',
'nthread': threads
}
bst = xgb.train( param, dtrain, n_estimators,
[(dtest, 'eval'), (dtrain, 'train')] )
print("XGBoost (no wrapper) Time: {}s".format(time.time() - tic))
preds = np.round(bst.predict(dtest) )
acc = 1. - (np.abs(preds - y_test).sum() / y_test.shape[0])
print("Acc: {}".format(acc))
if __name__ == '__main__':
main()
总结结果:
sklearn.ensemble.GradientBoostingClassifier()
sklearn.ensemble.GradientBoostingClassifier()
- 时间:0.003237009048461914s
- 准确度:1.0
sklearn xgboost API 包装器 XGBClassifier()
sklearn xgboost API wrapper XGBClassifier()
- 时间:0.3436141014099121s
- 准确度:1.0
XGBoost(无包装)xgb.train()
XGBoost (no wrapper) xgb.train()
- 时间:0.0028612613677978516s
- 准确度:1.0
这篇关于XGBoost predict_proba 推理性能慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!