复制训练示例以处理 pandas 数据框中的类不平衡 [英] Duplicating training examples to handle class imbalance in a pandas data frame
问题描述
我在熊猫中有一个DataFrame,其中包含训练示例,例如:
I have a DataFrame in pandas that contain training examples, for example:
feature1 feature2 class
0 0.548814 0.791725 1
1 0.715189 0.528895 0
2 0.602763 0.568045 0
3 0.544883 0.925597 0
4 0.423655 0.071036 0
5 0.645894 0.087129 0
6 0.437587 0.020218 0
7 0.891773 0.832620 1
8 0.963663 0.778157 0
9 0.383442 0.870012 0
我使用的
:
import pandas as pd
import numpy as np
np.random.seed(0)
number_of_samples = 10
frame = pd.DataFrame({
'feature1': np.random.random(number_of_samples),
'feature2': np.random.random(number_of_samples),
'class': np.random.binomial(2, 0.1, size=number_of_samples),
},columns=['feature1','feature2','class'])
print(frame)
如您所见,训练集是不平衡的(8个样本的等级为0,而只有2个样本的等级为1).我想对训练集进行过度采样.具体来说,我想复制1类的训练样本,以便平衡训练集(即0类的样本数与1类的样本数大致相同).我该怎么办?
As you can see, the training set is imbalanced (8 samples have class 0, while only 2 samples have class 1). I would like to oversample the training set. Specifically, I would like to duplicating training samples with class 1 so that the training set is balanced (i.e., where the number of samples with class 0 is approximately the same as the number of samples with class 1). How can I do so?
理想情况下,我想要一个可以推广到多类设置的解决方案(即,class列中的整数可以大于1).
Ideally I would like a solution that may generalize to a multiclass setting (i.e., the integer in the class column may be more than 1).
推荐答案
您可以通过
max_size = frame['class'].value_counts().max()
在您的示例中,该值等于8.对于每个组,您可以使用替换的max_size - len(group_size)
元素进行采样.这样,如果将它们连接到原始DataFrame,它们的大小将相同,并且将保留原始行.
In your example, this equals 8. For each group, you can sample with replacement max_size - len(group_size)
elements. This way if you concat these to the original DataFrame, their sizes will be the same and you'll keep the original rows.
lst = [frame]
for class_index, group in frame.groupby('class'):
lst.append(group.sample(max_size-len(group), replace=True))
frame_new = pd.concat(lst)
您可以使用max_size-len(group)
玩游戏,并且可能会添加一些噪音,因为这会使所有组的大小相等.
You can play with max_size-len(group)
and maybe add some noise to it because this will make all group sizes equal.
这篇关于复制训练示例以处理 pandas 数据框中的类不平衡的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!