AUC 高,但数据不平衡导致预测不佳 [英] High AUC but bad predictions with imbalanced data

查看:53
本文介绍了AUC 高,但数据不平衡导致预测不佳的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在非常不平衡的数据集上使用 LightGBM 构建分类器.不平衡的比例为 97:3,即:

I am trying to build a classifier with LightGBM on a very imbalanced dataset. Imbalance is in the ratio 97:3, i.e.:

Class

0    0.970691
1    0.029309

我使用的参数和训练代码如下所示.

Params I used and the code for training is as shown below.

lgb_params = {
        'boosting_type': 'gbdt',
        'objective': 'binary',
        'metric':'auc',
        'learning_rate': 0.1,
        'is_unbalance': 'true',  #because training data is unbalance (replaced with scale_pos_weight)
        'num_leaves': 31,  # we should let it be smaller than 2^(max_depth)
        'max_depth': 6, # -1 means no limit
        'subsample' : 0.78
    }

# Cross-validate
cv_results = lgb.cv(lgb_params, dtrain, num_boost_round=1500, nfold=10, 
                    verbose_eval=10, early_stopping_rounds=40)

nround = cv_results['auc-mean'].index(np.max(cv_results['auc-mean']))
print(nround)

model = lgb.train(lgb_params, dtrain, num_boost_round=nround)


preds = model.predict(test_feats)

preds = [1 if x >= 0.5 else 0 for x in preds]

我运行 CV 以获得最佳模型和最佳回合.我在 CV 上得到了 0.994 AUC,在验证集中得到了类似的分数.

I ran CV to get the best model and best round. I got 0.994 AUC on CV and similar score in Validation set.

但是当我在测试集上进行预测时,我得到了非常糟糕的结果.我确信训练集是完美采样的.

But when I am predicting on the test set I am getting very bad results. I am sure that the train set is sampled perfectly.

需要调整哪些参数.?问题的原因是什么.?我是否应该重新采样数据集,以便减少最高等级.

What parameters are needed to be tuned.? What is the reason for the problem.? Should I resample the dataset such that the highest class is reduced.?

推荐答案

问题是,尽管您的数据集中存在极端的类不平衡,您仍在使用默认"在决定最终硬分类时阈值为 0.5

The issue is that, despite the extreme class imbalance in your dataset, you are still using the "default" threshold of 0.5 when deciding the final hard classification in

preds = [1 if x >= 0.5 else 0 for x in preds]

这应该不是这里的情况.

这是一个相当大的话题,我强烈建议您自己进行研究(尝试使用谷歌搜索阈值截断概率不平衡数据),但这里有一些指导您入门...

This is a rather big topic, and I strongly suggest you do your own research (try googling for threshold or cut off probability imbalanced data), but here are some pointers to get you started...

来自交叉验证(强调):

不要忘记您应该智能地设置阈值以进行预测.当模型概率大于 0.5 时,预测 1 并不总是最好的.另一个门槛可能会更好.为此,您应该查看分类器的接收器操作特征 (ROC) 曲线,而不仅仅是它在默认概率阈值下的预测成功率.

Don't forget that you should be thresholding intelligently to make predictions. It is not always best to predict 1 when the model probability is greater 0.5. Another threshold may be better. To this end you should look into the Receiver Operating Characteristic (ROC) curves of your classifier, not just its predictive success with a default probability threshold.

来自相关学术论文,Finding the Best Classification Threshold in Imbalanced Classification:

2.2.如何设置测试集的分类阈值

预测结果是最终决定根据到预言概率.这临界点是通常放到0.5.如果这预言可能性超过0.5,这样本是预料到的到是积极的;否则,消极的.然而,0.5是不是理想的为了一些案例,特别为了不平衡的数据集.

Prediction results are ultimately determined according to prediction probabilities. The threshold is typically set to 0.5. If the prediction probability exceeds 0.5, the sample is predicted to be positive; otherwise, negative. However, 0.5 is not ideal for some cases, particularly for imbalanced datasets.

帖子优化类不平衡的概率阈值(强烈推荐)应用预测建模博客也与此相关.

The post Optimizing Probability Thresholds for Class Imbalances from the (highly recommended) Applied Predictive Modeling blog is also relevant.

从以上所有内容中吸取教训:AUC 很少足够,但 ROC 曲线本身通常是您最好的朋友......

Take home lesson from all the above: AUC is seldom enough, but the ROC curve itself is often your best friend...

在更一般的层面上,关于阈值本身在分类过程中的作用(至少根据我的经验,很多从业者都弄错了),还请检查 分类概率阈值线程(和提供的链接)在交叉验证;关键点:

On a more general level regarding the role of the threshold itself in the classification process (which, according to my experience at least, many practitioners get wrong), check also the Classification probability threshold thread (and the provided links) at Cross Validated; key point:

当您为新样本的每一类输出概率时,练习的统计部分就结束了.选择一个阈值,超过该阈值,您将新观察结果分类为 1 与 0 不再是统计数据的一部分.它是决策组件的一部分.

the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.

这篇关于AUC 高,但数据不平衡导致预测不佳的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆