复制训练示例以处理 Pandas 数据框中的类不平衡 [英] Duplicating training examples to handle class imbalance in a pandas data frame

查看:23
本文介绍了复制训练示例以处理 Pandas 数据框中的类不平衡的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Pandas 中有一个包含训练示例的 DataFrame,例如:

I have a DataFrame in pandas that contain training examples, for example:

   feature1  feature2  class
0  0.548814  0.791725      1
1  0.715189  0.528895      0
2  0.602763  0.568045      0
3  0.544883  0.925597      0
4  0.423655  0.071036      0
5  0.645894  0.087129      0
6  0.437587  0.020218      0
7  0.891773  0.832620      1
8  0.963663  0.778157      0
9  0.383442  0.870012      0

我使用:

import pandas as pd
import numpy as np

np.random.seed(0)
number_of_samples = 10
frame = pd.DataFrame({
    'feature1': np.random.random(number_of_samples),
    'feature2': np.random.random(number_of_samples),
    'class':    np.random.binomial(2, 0.1, size=number_of_samples), 
    },columns=['feature1','feature2','class'])

print(frame)

如您所见,训练集是不平衡的(8 个样本属于 0 类,而只有 2 个样本属于 1 类).我想对训练集进行过采样.具体来说,我想复制第 1 类的训练样本,以便训练集是平衡的(即,第 0 类的样本数量与第 1 类的样本数量大致相同).我该怎么做?

As you can see, the training set is imbalanced (8 samples have class 0, while only 2 samples have class 1). I would like to oversample the training set. Specifically, I would like to duplicating training samples with class 1 so that the training set is balanced (i.e., where the number of samples with class 0 is approximately the same as the number of samples with class 1). How can I do so?

理想情况下,我想要一个可以推广到多类设置的解决方案(即类列中的整数可能大于 1).

Ideally I would like a solution that may generalize to a multiclass setting (i.e., the integer in the class column may be more than 1).

推荐答案

您可以使用

max_size = frame['class'].value_counts().max()

在您的示例中,这等于 8.对于每个组,您可以使用替换 max_size - len(group_size) 元素进行采样.这样,如果您将这些连接到原始 DataFrame,它们的大小将相同,并且您将保留原始行.

In your example, this equals 8. For each group, you can sample with replacement max_size - len(group_size) elements. This way if you concat these to the original DataFrame, their sizes will be the same and you'll keep the original rows.

lst = [frame]
for class_index, group in frame.groupby('class'):
    lst.append(group.sample(max_size-len(group), replace=True))
frame_new = pd.concat(lst)

您可以使用 max_size-len(group) 并可能添加一些噪音,因为这将使所有组大小相等.

You can play with max_size-len(group) and maybe add some noise to it because this will make all group sizes equal.

这篇关于复制训练示例以处理 Pandas 数据框中的类不平衡的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆