如何在 Keras 中使用 fit_generator() 平衡数据集? [英] How to balance dataset using fit_generator() in Keras?

查看:51
本文介绍了如何在 Keras 中使用 fit_generator() 平衡数据集?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 keras 拟合 CNN 模型来对 2 类数据进行分类.我有不平衡的数据集我想平衡数据.我不知道我可以在 model.fit_generator 中使用 class_weight .我想知道我是否在 model.fit_generator

I am trying to use keras to fit a CNN model to classify 2 classes of data . I have imbalanced dataset I want to balance the data. I don't know can I use class_weight in model.fit_generator . I wonder if I used class_weight="balanced" in model.fit_generator

主要代码:

def generate_arrays_for_training(indexPat, paths, start=0, end=100):      
    while True:
        from_=int(len(paths)/100*start)
        to_=int(len(paths)/100*end)
        for i in range(from_, int(to_)):
            f=paths[i]
            x = np.load(PathSpectogramFolder+f) 
            x = np.expand_dims(x, axis=0) 
            
            if('P' in f):
                y = np.repeat([[0,1]],x.shape[0], axis=0)
            else:
                y =np.repeat([[1,0]],x.shape[0], axis=0)
            yield(x,y)   
history=model.fit_generator(generate_arrays_for_training(indexPat, filesPath, end=75), 
                                validation_data=generate_arrays_for_training(indexPat, filesPath, start=75),
                                steps_per_epoch=int((len(filesPath)-int(len(filesPath)/100*25))), 
                                validation_steps=int((len(filesPath)-int(len(filesPath)/100*75))),
                                verbose=2,
                                epochs=15, max_queue_size=2, shuffle=True, callbacks=[callback])

推荐答案

如果您不想更改数据创建过程,可以在拟合生成器中使用 class_weight.您可以使用字典来设置您的 class_weight 并通过微调进行观察.例如,当不使用 class_weight 时,class0 有 50 个示例,class1 有 100 个示例.然后,损失函数统一计算损失.这意味着 class1 将是一个问题.但是,当您设置:

If you don't want to change your data creation process, you can use class_weight in your fit generator. You can use dictionary to set your class_weight and observe with fine tuning. For instance when class_weight is not used, and you have 50 examples for class0 and 100 examples for class1. Then, loss function calculate loss uniformly. It means that class1 will be a problem. But, when you set:

class_weight = {0:2 , 1:1}

这意味着损失函数现在将为您的 0 类赋予 2 倍的权重.因此,对代表性不足的数据进行错误分类将需要比以前多 2 倍的惩罚.因此,模型可以处理不平衡的数据.

It means that loss function will give 2 times weight to your class 0 now. Therefore, misclassification of underrepresented data will take 2 times more punishment than before. Thus, model can handle imbalanced data.

如果您使用 class_weight='balanced' 模型可以自动进行该设置.但我的建议是,创建一个类似于 class_weight = {0:a1 , 1:a2} 的字典,并尝试为 a1 和 a2 设置不同的值,以便您了解差异.

If you use class_weight='balanced' model can make that setting automatically. But my suggestion is that, create a dictionary like class_weight = {0:a1 , 1:a2} and try different values for a1 and a2, so you can understand difference.

此外,您可以对不平衡数据使用欠采样方法,而不是使用 class_weight.为此目的检查引导方法.

Also, you can use undersampling methods for imbalanced data instead of using class_weight. Check Bootstrapping methods for that purpose.

这篇关于如何在 Keras 中使用 fit_generator() 平衡数据集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆