xgboost predict 方法为所有行返回相同的预测值 [英] xgboost predict method returns the same predicted value for all rows

查看:195
本文介绍了xgboost predict 方法为所有行返回相同的预测值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用 Python 创建了一个 xgboost 分类器:

I've created an xgboost classifier in Python:

train 是一个 Pandas 数据框,有 10 万行和 50 个特征作为列.目标是熊猫系列

train is a pandas dataframe with 100k rows and 50 features as columns. target is a pandas series

xgb_classifier = xgb.XGBClassifier(nthread=-1, max_depth=3, silent=0, 
                                   objective='reg:linear', n_estimators=100)
xgb_classifier = xgb_classifier.fit(train, target)

predictions = xgb_classifier.predict(test)

但是,在训练之后,当我使用这个分类器来预测值时,整个结果数组是相同的数字.知道为什么会发生这种情况吗?

However, after training, when I use this classifier to predict values the entire results array is the same number. Any idea why this would be happening?

数据说明:约 50 个具有数值目标的数值特征

Data clarification: ~50 numerical features with a numerical target

我也试过 RandomForestRegressor 来自 sklearn 与相同的数据,它确实给出了现实的预测.也许是 xgboost 实现中的一个合法错误?

I've also tried RandomForestRegressor from sklearn with the same data and it does give realistic predictions. Perhaps a legitimate bug in the xgboost implementation?

推荐答案

此问题已收到多个回复,包括在此线程以及 此处此处.

This question has received several responses including on this thread as well as here and here.

我在使用 XGBoost 和 LGBM 时遇到了类似的问题.对我来说,解决方案是增加训练数据集的大小.

I was having a similar issue with both XGBoost and LGBM. For me, the solution was to increase the size of the training dataset.

我在本地机器上使用大型稀疏数据集(200,000 行和 7000 列)的随机样本(约 0.5%)进行训练,因为我没有足够的本地内存用于该算法.结果证明,对我来说,预测值数组只是目标变量平均值的数组.这向我表明该模型可能欠拟合.欠拟合模型的一个解决方案是在更多数据上训练模型,所以我在具有更多内存的机器上尝试了我的分析,问题得到了解决:我的预测数组不再是平均目标值的数组.另一方面,问题可能只是我正在查看的预测值片段是从具有很少信息(例如 0 和 nan )的训练数据中预测出来的.对于信息很少的训练数据,预测目标特征的平均值似乎是合理的.

I was training on a local machine using a random sample (~0.5%) of a large sparse dataset (200,000 rows and 7000 columns) because I did not have enough local memory for the algorithm. It turned out that for me, the array of predicted values was just an array of the average values of the target variable. This suggests to me that the model may have been underfitting. One solution to an underfitting model is to train your model on more data, so I tried my analysis on a machine with more memory and the issue was resolved: my prediction array was no longer an array of average target values. On the other hand, the issue could simply have been that the slice of predicted values I was looking at were predicted from training data with very little information (e.g. 0's and nan's). For training data with very little information, it seems reasonable to predict the average value of the target feature.

我遇到的其他建议解决方案都对我没有帮助.总结一些建议的解决方案,包括:1) 检查 gamma 是否太高2)确保您的目标标签不包含在您的训练数据集中3) max_depth 可能太小了.

None of the other suggested solutions I came across were helpful for me. To summarize some of the suggested solutions included: 1) check if gamma is too high 2) make sure your target labels are not included in your training dataset 3) max_depth may be too small.

这篇关于xgboost predict 方法为所有行返回相同的预测值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆