如何在Keras中使用fit_generator()平衡数据集? [英] How to balance dataset using fit_generator() in Keras?
问题描述
我正在尝试使用keras拟合CNN模型以对2类数据进行分类。我的数据集不平衡,我想平衡数据。我不知道可以在 model.fit_generator
中使用class_weight。我想知道是否在 model.fit_generator
<中使用了
class_weight = balanced p> 主要代码: def generate_arrays_for_training(indexPat,path,start = 0,end = 100):
而True:
from_ = int(len(路径)/ 100 *开始)
to_ = int(len(路径)/ 100 *结束)
for i in range(from_ ,int(to_)):
f = paths [i]
x = np.load(PathSpectogramFolder + f)
x = np.expand_dims(x,axis = 0)
if('f'中的'P'):
y = np.repeat([[[0,1]],x.shape [0],axis = 0)
else:
y = np.repeat([[1,0]],x.shape [0],轴= 0)
yield(x,y)
history = model.fit_generator(generate_arrays_for_training(indexPat,filesPath,end = 75),
validation_data = generate_arrays_for_training(indexPat,filesPath,start = 75),
s teps_per_epoch = int((len(filesPath)-int(len(filesPath)/ 100 * 25))),
validation_steps = int((len(filesPath)-int(len(filesPath)/ 100 * 75)) ),
verbose = 2,
epochs = 15,max_queue_size = 2,shuffle = True,callbacks = [callback])
解决方案如果不想更改数据创建过程,可以使用 class_weight
在您的健康生成器中。您可以使用字典来设置您的class_weight并进行微调。例如,当不使用class_weight时,class0有50个示例,class1有100个示例。然后,损失函数统一计算损失。这意味着class1将是一个问题。但是,当您设置以下内容时:
class_weight = {0:2,1:1}
这意味着损失函数现在将给您的班级0 2倍的权重。因此,对代表性不足的数据进行错误分类将比以前多付2倍的惩罚。因此,模型可以处理不平衡的数据。
如果您使用 class_weight ='balanced'
模型,则可以自动进行该设置。但是我的建议是,创建一个像 class_weight = {0:a1,1:a2}
的字典,并为a1和a2尝试不同的值,这样您就可以理解差异。 / p>
此外,您可以对数据不平衡使用欠采样方法,而不是使用class_weight。为此,请检查 Bootstrapping 方法。
I am trying to use keras to fit a CNN model to classify 2 classes of data . I have imbalanced dataset I want to balance the data. I don't know can I use class_weight in model.fit_generator
. I wonder if I used class_weight="balanced"
in model.fit_generator
The main code:
def generate_arrays_for_training(indexPat, paths, start=0, end=100):
while True:
from_=int(len(paths)/100*start)
to_=int(len(paths)/100*end)
for i in range(from_, int(to_)):
f=paths[i]
x = np.load(PathSpectogramFolder+f)
x = np.expand_dims(x, axis=0)
if('P' in f):
y = np.repeat([[0,1]],x.shape[0], axis=0)
else:
y =np.repeat([[1,0]],x.shape[0], axis=0)
yield(x,y)
history=model.fit_generator(generate_arrays_for_training(indexPat, filesPath, end=75),
validation_data=generate_arrays_for_training(indexPat, filesPath, start=75),
steps_per_epoch=int((len(filesPath)-int(len(filesPath)/100*25))),
validation_steps=int((len(filesPath)-int(len(filesPath)/100*75))),
verbose=2,
epochs=15, max_queue_size=2, shuffle=True, callbacks=[callback])
解决方案 If you don't want to change your data creation process, you can use class_weight
in your fit generator. You can use dictionary to set your class_weight and observe with fine tuning. For instance when class_weight is not used, and you have 50 examples for class0 and 100 examples for class1. Then, loss function calculate loss uniformly. It means that class1 will be a problem. But, when you set:
class_weight = {0:2 , 1:1}
It means that loss function will give 2 times weight to your class 0 now. Therefore, misclassification of underrepresented data will take 2 times more punishment than before. Thus, model can handle imbalanced data.
If you use class_weight='balanced'
model can make that setting automatically. But my suggestion is that, create a dictionary like class_weight = {0:a1 , 1:a2}
and try different values for a1 and a2, so you can understand difference.
Also, you can use undersampling methods for imbalanced data instead of using class_weight. Check Bootstrapping methods for that purpose.
这篇关于如何在Keras中使用fit_generator()平衡数据集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!