xgboost预测方法为所有行返回相同的预测值 [英] xgboost predict method returns the same predicted value for all rows

查看:640
本文介绍了xgboost预测方法为所有行返回相同的预测值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在Python中创建了xgboost分类器:

I've created an xgboost classifier in Python:

train是一个熊猫数据框,具有10万行和50列特征. 目标是熊猫系列

train is a pandas dataframe with 100k rows and 50 features as columns. target is a pandas series

xgb_classifier = xgb.XGBClassifier(nthread=-1, max_depth=3, silent=0, 
                                   objective='reg:linear', n_estimators=100)
xgb_classifier = xgb_classifier.fit(train, target)

predictions = xgb_classifier.predict(test)

但是,经过训练后,当我使用此分类器预测值时,整个结果数组是相同的数字.知道为什么会这样吗?

However, after training, when I use this classifier to predict values the entire results array is the same number. Any idea why this would be happening?

数据说明: 具有数字目标的〜50个数字特征

Data clarification: ~50 numerical features with a numerical target

我也用相同的数据尝试了sklearn的RandomForest回归,它确实给出了真实的预测.也许是xgboost实现中的合法错误?

I've also tried RandomForest Regression from sklearn with the same data and it does give realistic predictions. Perhaps a legitimate bug in the xgboost implementation?

推荐答案

此问题已收到一些答复,包括对此线程以及此处.

This question has received several responses including on this thread as well as here and here.

我在XGBoost和LGBM上都有类似的问题.对我来说,解决方案是增加训练数据集的大小.

I was having a similar issue with both XGBoost and LGBM. For me, the solution was to increase the size of the training dataset.

我在本地机器上使用大型稀疏数据集(200,000行和7,000列)的随机样本(〜0.5%)进行训练,因为我没有足够的本地内存来执行该算法.事实证明,对我而言,预测值数组只是目标变量平均值的数组.这向我表明该模型可能不适合.欠拟合模型的一种解决方案是在更多数据上训练模型,因此我在具有更多内存的机器上尝试了分析,问题得以解决:我的预测数组不再是平均目标值数组.另一方面,问题可能只是我正在查看的预测值的一部分是从训练数据中预测的,这些信息的信息很少(例如0和nan).对于信息量很少的训练数据,预测目标特征的平均值似乎是合理的.

I was training on a local machine using a random sample (~0.5%) of a large sparse dataset (200,000 rows and 7000 columns) because I did not have enough local memory for the algorithm. It turned out that for me, the array of predicted values was just an array of the average values of the target variable. This suggests to me that the model may have been underfitting. One solution to an underfitting model is to train your model on more data, so I tried my analysis on a machine with more memory and the issue was resolved: my prediction array was no longer an array of average target values. On the other hand, the issue could simply have been that the slice of predicted values I was looking at were predicted from training data with very little information (e.g. 0's and nan's). For training data with very little information, it seems reasonable to predict the average value of the target feature.

我遇到的其他建议解决方案都没有对我有帮助.总结一些建议的解决方案包括: 1)检查伽玛是否太高 2)确保您的目标标签未包含在您的训练数据集中 3)max_depth可能太小.

None of the other suggested solutions I came across were helpful for me. To summarize some of the suggested solutions included: 1) check if gamma is too high 2) make sure your target labels are not included in your training dataset 3) max_depth may be too small.

这篇关于xgboost预测方法为所有行返回相同的预测值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆