用于不平衡二元分类的过采样数据的过程 [英] Process for oversampling data for imbalanced binary classification

查看:17
本文介绍了用于不平衡二元分类的过采样数据的过程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大约 30% 和 70% 的 0 类(少数类)和 1 类(多数类).由于我没有很多数据,我计划对少数类进行过采样以平衡这些类,使其成为 50-50 的分割.我想知道是否应该在将数据拆分为训练集和测试集之前或之后进行过采样.我通常在在线示例中看到它在拆分之前完成,如下所示:

I have about a 30% and 70% for class 0 (minority class) and class 1 (majority class). Since I do not have a lot of data, I am planning to oversample the minority class to balance out the classes to become a 50-50 split. I was wondering if oversampling should be done before or after splitting my data into train and test sets. I have generally seen it done before splitting in online examples, like this:

df_class0 = train[train.predict_var == 0]
df_class1 = train[train.predict_var == 1]
df_class1_over = df_class1.sample(len(df_class0), replace=True)
df_over = pd.concat([df_class0, df_class1_over], axis=0)

然而,这是否意味着测试数据可能有来自训练集的重复样本(因为我们对训练集进行了过采样)?这意味着测试性能不一定基于新的、看不见的数据.我这样做很好,但我想知道什么是好的做法.谢谢!

However, wouldn't that mean that the test data will likely have duplicated samples from the training set (because we have oversampled the training set)? This means that testing performance wouldn't necessarily be on new, unseen data. I am fine doing this, but I would like to know what is considered good practice. Thank you!

推荐答案

我想知道是否应该在将数据拆分为训练集和测试集之前或之后进行过采样.

I was wondering if oversampling should be done before or after splitting my data into train and test sets.

它当然应该在分裂之后完成,即它应该只应用于你的训练集,而不是你的验证和测试集;另请参阅我的相关答案此处.

It should certainly be done after splitting, i.e. it should be applied only to your training set, and not to your validation and test ones; see also my related answer here.

我通常在在线示例中拆分之前看到它完成,就像这样

I have generally seen it done before splitting in online examples, like this

从您显示的代码片段来看,正如您声称的那样,它是在拆分之前完成的,这一点并不明显.这取决于 train 变量到底是什么:如果它是训练测试拆分的产物,那么过采样确实发生在 拆分之后,因为它应该是.

From the code snippet you show, it is not at all obvious that it is done before splitting, as you claim. It depends on what exactly the train variable is here: if it is the product of a train-test split, then the oversampling takes place after splitting indeed, as it should be.

然而,这是否意味着测试数据可能有来自训练集的重复样本(因为我们对训练集进行了过采样)?这意味着测试性能不一定基于新的、未见的数据.

However, wouldn't that mean that the test data will likely have duplicated samples from the training set (because we have oversampled the training set)? This means that testing performance wouldn't necessarily be on new, unseen data.

正是如此,这就是为什么应该在拆分训练测试之后而不是之前进行过采样的原因.

Exactly, this is the reason why the oversampling should be done after splitting to train-test, and not before.

(我曾经目睹过一个案例,建模者很难理解为什么他的测试准确率达到了 100%,远高于他的训练准确率;结果他的初始数据集充满了重复——这里没有类别不平衡,但是这个想法是相似的——其中一些重复项在拆分后自然而然地出现在他的测试集中,当然不是新的或看不见的数据......)

(I once witnessed a case where the modeller was struggling to understand why he was getting a ~ 100% test accuracy, much higher than his training one; turned out his initial dataset was full of duplicates -no class imbalance here, but the idea is similar- and several of these duplicates naturally ended up in his test set after the split, without of course being new or unseen data...).

我做得很好

你不应该:)

这篇关于用于不平衡二元分类的过采样数据的过程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆