上采样不平衡数据集的小类 [英] Up Sampling imbalanced dataset's minor classes

查看:62
本文介绍了上采样不平衡数据集的小类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 scikit-learn 对我的数据进行分类,目前我正在运行一个简单的 DecisionTree 分类器.我有三个班级有很大的不平衡问题.班级是 0,1 和 2.次要班级是 1 和 2.

为了让您了解类的样本数量:

0 = 25.000 个样本1 = 15/20 更少或更多2 = 15/20 更少或更多

因此次要类别约占数据集的 0.06%.我遵循的解决不平衡问题的方法是对小类进行 UPSAMPLING.代码:

from sklearn.utils import resample,重新采样(数据,replace=True,n_samples=len_major_class,random_state=1234)

问题来了.我做了两个测试:

  1. 如果我对次要类进行上采样,然后将我的数据集分成两组,一组用于训练,一组用于测试......准确度为:

<块引用>

 精确召回 f1-score 支持0 1.00 1.00 1.00 205701 1.00 1.00 1.00 205332 1.00 1.00 1.00 20439平均/总计 1.00 1.00 1.00 61542

非常好的结果.

  1. 如果我只对训练数据进行上采样并保留原始数据进行测试,结果是:

<块引用>

 精确召回 f1-score 支持0 1.00 1.00 1.00 205701 0.00 0.00 0.00 152 0.00 0.00 0.00 16平均/总计 1.00 1.00 1.00 20601

如您所见,全局准确度很高,但第 1 类和第 2 类的准确度.

我正在以这种方式创建分类器:

DecisionTreeClassifier(max_depth=20, max_features=0.4, random_state=1234,criteria='entropy')

我也尝试添加 class_weightbalanced 值,但没有区别.

我应该只对训练数据进行上采样,为什么我会遇到这个奇怪的问题?

解决方案

在拆分之前进行重新采样时,您获得这种行为是很正常的;您在数据中引入了偏差.

如果对数据进行过采样然后进行拆分,则测试中的少数样本将不再独立于训练集中的样本,因为它们是一起生成的.在您的情况下,它们是训练集中样本的精确副本.您的准确率为 100%,因为分类器正在对训练中已经见过的样本进行分类.

由于您的问题严重不平衡,我建议使用一组分类器来处理它.1)将数据集拆分为训练集和测试集.给定数据集的大小,您可以从少数类中抽取 1-2 个样本进行测试,而将另一个用于训练.2)从训练中生成 N 个数据集,其中包含少数类的所有剩余样本和多数类的欠样本(我会说 2*少数类中的样本数).3)对于获得的每个数据集,您都训练一个模型.4)使用测试集得到预测;最终预测将是分类器所有预测的多数票结果.

为了让稳健的指标通过不同的初始拆分测试/训练执行不同的迭代.

i am using scikit-learn to classify my data, at the moment i am running a simple DecisionTree classifier. I have three classes with a big imbalanced problem. The classes are 0,1 and 2. The minor classes are 1 and 2.

To give you an idea about the number of samples of the classes:

0 = 25.000 samples
1 = 15/20 less or more
2 = 15/20 less or more

so minor classes are about 0.06% of the dataset. The approach that i am following to solve the imbalance problem is the UPSAMPLING of the minor classes. Code:

from sklearn.utils import resample,
resample(data, replace=True, n_samples=len_major_class, random_state=1234)

Now comes the problem. I did two tests:

  1. If I upsample the minor classes and then divide my dataset in two groups one for training and one for testing... the accuracy is:

             precision    recall  f1-score   support

          0       1.00      1.00      1.00     20570
          1       1.00      1.00      1.00     20533
          2       1.00      1.00      1.00     20439

avg / total       1.00      1.00      1.00     61542

very good result.

  1. If I ONLY upsample the training data and leave the original data for testing, the result is:

             precision    recall  f1-score   support

          0       1.00      1.00      1.00     20570
          1       0.00      0.00      0.00        15
          2       0.00      0.00      0.00        16

avg / total       1.00      1.00      1.00     20601

as you can see the global accuracy is high, but the accuracy of the class 1 and 2 is zero.

I am creating the classifier in this way:

DecisionTreeClassifier(max_depth=20, max_features=0.4, random_state=1234, criterion='entropy')

I also have tried adding the class_weight with balanced value, but it makes no difference.

I should only upsample the training data, why am i getting this strange problem?

解决方案

The fact that you obtain that behavior is quite normal when you do the re-sampling before the splitting; you are inducing a bias in your data.

If you oversample the data and then split, the minority samples in the test won't be anymore independent from the samples in the training set because they are generated together. In your case they are exact copies of the samples in the training set. Your accuracy is 100% because the classifier is classifying samples that have already been seen in the training.

Since your problem is strongly umbalanced I would suggest to use an ensemble of classifiers to handle it. 1) Split your dataset in training set and test set. Given the size of the dataset you can sample 1-2 samples from the minority class for test and leave the other for training. 2) From the training you generate N datasets containing all the remaining samples of the minority class and under-samples from the majority class (i would say 2*number of samples in the minority class). 3) For each one of the dataset obtained you train a model. 4) Use the test set to obtain the prediction; the final prediction will be the results of a majority vote of all the predictions of the classifiers.

To have robust metrics perform different iterations with different initial splitting test/training.

这篇关于上采样不平衡数据集的小类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆