处理二进制分类中的类不平衡 [英] Dealing with the class imbalance in binary classification

查看:133
本文介绍了处理二进制分类中的类不平衡的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下是我的问题的简要说明:

Here's a brief description of my problem:

  1. 我正在从事监督学习任务,以训练 binary 分类器.
  2. 我有一个大型的 imbalance 分布数据集:每个正数有8个负数实例.
  3. 我使用 f度量(即,特异性和敏感性之间的谐波均值)来评估分类器的性能.
  1. I am working on a supervised learning task to train a binary classifier.
  2. I have a dataset with a large class imbalance distribution: 8 negative instances every one positive.
  3. I use the f-measure, i.e. the harmonic mean between specificity and sensitivity, to assess the performance of a classifier.

我绘制了多个分类器的ROC图,并且所有分类器都具有很好的AUC,这意味着分类很好.但是,当我测试分类器并计算f测度时,我得到的值确实很低.我知道此问题是由数据集的类偏斜引起的,到目前为止,我发现了两种解决方法:

I plot the ROC graphs of several classifiers and all present a great AUC, meaning that the classification is good. However, when I test the classifier and compute the f-measure I get a really low value. I know that this issue is caused by the class skewness of the dataset and, by now, I discover two options to deal with it:

  1. 通过为数据集的实例分配权重来采用成本敏感方法(请参阅此
  1. Adopting a cost-sensitive approach by assigning weights to the dataset's instances (see this post)
  2. Thresholding the predicted probabilities returned by the classifiers, to reduce the number of false positives and false negatives.

我选择了第一个选项,这解决了我的问题(f测度令人满意).但是,现在,我的问题是:哪种方法更可取?和有什么区别?

I went for the first option and that solved my issue (f-measure is satisfactory). BUT, now, my question is: which of these methods is preferable? And what are the differences?

PS:我正在将Python与scikit-learn库一起使用.

推荐答案

权重(对成本敏感)和阈值确定都是对成本敏感的学习的有效形式.简而言之,您可以考虑以下两种情况:

Both weighting (cost-sensitive) and thresholding are valid forms of cost-sensitive learning. In the briefest terms, you can think of the two as follows:

从本质上讲,有人认为对稀有阶级进行错误分类的成本"要比对普通阶级进行错误分类的后果更为严重.它被应用于算法级别,例如SVM,ANN和Random Forest.这里的限制包括算法是否可以处理权重.此外,此方法的许多应用都试图解决进行更严重错误分类的想法(例如,将患有胰腺癌的人分类为未患有癌症的人).在这种情况下,您知道,为什么要确保即使在不平衡的设置中也可以对特定的班级进行分类.理想情况下,您希望像其他模型参数一样优化成本参数.

Essentially one is asserting that the ‘cost’ of misclassifying the rare class is worse than misclassifying the common class. This is applied at the algorithmic level in such algorithms as SVM, ANN, and Random Forest. The limitations here consist of whether the algorithm can deal with weights. Furthermore, many applications of this are trying to address the idea of making a more serious misclassification (e.g. classifying someone who has pancreatic cancer as non having cancer). In such circumstances, you know why you want to make sure you classify specific classes even in imbalanced settings. Ideally you want to optimize the cost parameters as you would any other model parameter.

如果算法返回概率(或其他分数),则可以在构建模型后应用阈值.本质上,您将分类阈值从50-50更改为适当的折衷级别.通常可以通过生成评估度量(例如F量度)的曲线来优化这一点.这里的限制是您要进行绝对的权衡.截止值的任何修改都会降低预测其他类别的准确性.如果您对大多数普通班级(例如高于0.85的大多数班级)具有极高的概率,则此方法更有可能获得成功.它也是独立于算法的(前提是算法返回概率).

If the algorithm returns probabilities (or some other score), thresholding can be applied after a model has been built. Essentially you change the classification threshold from 50-50 to an appropriate trade-off level. This typically can be optimized by generated a curve of the evaluation metric (e.g. F-measure). The limitation here is that you are making absolute trade-offs. Any modification in the cutoff will in turn decrease the accuracy of predicting the other class. If you have exceedingly high probabilities for the majority of your common classes (e.g. most above 0.85) you are more likely to have success with this method. It is also algorithm independent (provided the algorithm returns probabilities).

采样是应用于不平衡数据集的另一个常见选项,可以使类分布具有一些平衡.本质上有两种基本方法.

Sampling is another common option applied to imbalanced datasets to bring some balance to the class distributions. There are essentially two fundamental approaches.

欠采样

Under-sampling

提取较少的少数多数实例并保留少数.这将导致较小的数据集,其中类之间的分布更紧密;但是,您已经丢弃了可能有价值的数据.如果您有大量数据,这也可能是有益的.

Extract a smaller set of the majority instances and keep the minority. This will result in a smaller dataset where the distribution between classes is closer; however, you have discarded data that may have been valuable. This could also be beneficial if you have a very large amount of data.

过度采样

Over-sampling

通过复制少数实例来增加数量.这将导致更大的数据集保留所有原始数据,但可能会引入偏差.但是,随着大小的增加,您也可能会开始影响计算性能.

Increase the number of minority instances by replicating them. This will result in a larger dataset which retains all the original data but may introduce bias. As you increase the size, however, you may begin to impact computational performance as well.

高级方法

Advanced Methods

还有更多的复杂"方法可以帮助解决潜在的偏见.这些方法包括 SMOTE

There are additional methods that are more ‘sophisticated’ to help address potential bias. These include methods such as SMOTE, SMOTEBoost and EasyEnsemble as referenced in this prior question regarding imbalanced datasets and CSL.

关于使用不平衡数据构建模型的另一条注意事项是,您应牢记模型指标.例如,诸如F量度之类的量度没有考虑真实的负利率.因此,通常建议在不平衡的环境中使用诸如科恩的kappa指标之类的指标.

One further note regarding building models with imbalanced data is that you should keep in mind your model metric. For example, metrics such as F-measures don’t take into account the true negative rate. Therefore, it is often recommended that in imbalanced settings to use metrics such as Cohen’s kappa metric.

这篇关于处理二进制分类中的类不平衡的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆