为不平衡的二进制分类对数据进行过采样的过程 [英] Process for oversampling data for imbalanced binary classification

查看:277
本文介绍了为不平衡的二进制分类对数据进行过采样的过程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的班级0(少数族裔)和班级1(多数是阶级)分别占30%和70%。由于我没有很多数据,因此我计划对少数群体进行过度采样,以平衡这些群体以达到50-50的比例。我想知道是否应该在将数据分为训练集和测试集之前或之后进行过采样。我通常在拆分成在线示例之前就已经看到它完成了,例如:

I have about a 30% and 70% for class 0 (minority class) and class 1 (majority class). Since I do not have a lot of data, I am planning to oversample the minority class to balance out the classes to become a 50-50 split. I was wondering if oversampling should be done before or after splitting my data into train and test sets. I have generally seen it done before splitting in online examples, like this:

df_class0 = train[train.predict_var == 0]
df_class1 = train[train.predict_var == 1]
df_class1_over = df_class1.sample(len(df_class0), replace=True)
df_over = pd.concat([df_class0, df_class1_over], axis=0)

但是,这并不意味着测试数据可能会重复训练集(因为我们对训练集进行了过度采样)?这意味着测试性能不一定要基于新的,看不见的数据。我这样做很好,但是我想知道什么是好的做法。谢谢!

However, wouldn't that mean that the test data will likely have duplicated samples from the training set (because we have oversampled the training set)? This means that testing performance wouldn't necessarily be on new, unseen data. I am fine doing this, but I would like to know what is considered good practice. Thank you!

推荐答案


我想知道是否应在将数据分割成多采样之前或之后进行过采样。训练和测试集。

I was wondering if oversampling should be done before or after splitting my data into train and test sets.

当然应该在拆分后完成,即应仅将其应用于您的培训集,而不是您的验证和测试集;另请参见我的相关答案在这里

It should certainly be done after splitting, i.e. it should be applied only to your training set, and not to your validation and test ones; see also my related answer here.


我一般认为在拆分在线示例之前就已经做到了,例如

I have generally seen it done before splitting in online examples, like this

从您显示的代码片段中可以看出,拆分之前并没有完成。这取决于 train 变量的确切位置:如果是火车测试拆分的乘积,则过采样会在 拆分之后进行确实如此。

From the code snippet you show, it is not at all obvious that it is done before splitting, as you claim. It depends on what exactly the train variable is here: if it is the product of a train-test split, then the oversampling takes place after splitting indeed, as it should be.


但是,这并不意味着测试数据很可能具有训练集中的重复样本(因为我们对训练集进行了过度采样)?这意味着测试性能不一定要针对新的,看不见的数据。

However, wouldn't that mean that the test data will likely have duplicated samples from the training set (because we have oversampled the training set)? This means that testing performance wouldn't necessarily be on new, unseen data.

确切地说,这就是过采样应为

Exactly, this is the reason why the oversampling should be done after splitting to train-test, and not before.

(我曾经目睹过一个案例,建模者正在努力了解他为何获得〜100%的测试准确度的情况,比训练有素的要高得多;原来他的初始数据集充满了重复项-这里没有班级失衡,但想法很相似-并且其中一些重复项自然会在拆分后最终出现在他的测试集中,当然不是新的或看不见的数据...)。

(I once witnessed a case where the modeller was struggling to understand why he was getting a ~ 100% test accuracy, much higher than his training one; turned out his initial dataset was full of duplicates -no class imbalance here, but the idea is similar- and several of these duplicates naturally ended up in his test set after the split, without of course being new or unseen data...).


这样做很好

I am fine doing this

您不应该:)

这篇关于为不平衡的二进制分类对数据进行过采样的过程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆