用keras图像生成器平衡不平衡的数据集 [英] balancing an imbalanced dataset with keras image generator

查看:118
本文介绍了用keras图像生成器平衡不平衡的数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

喀拉拉邦

ImageDataGenerator

可用于"通过实时数据增强生成张量图像数据"

教程此处演示了如何使用ImageDataGenerator来扩充一个小的但平衡的数据集.是否有一种简单的方法可以使用此生成器来扩充严重不平衡的数据集,从而使生成的结果数据集达到平衡?

The tutorial here demonstrates how a small but balanced dataset can be augmented using the ImageDataGenerator. Is there an easy way to use this generator to augment a heavily unbalanced dataset, such that the resulting, generated dataset is balanced?

推荐答案

这不是处理不平衡数据的标准方法.我也不认为这是有道理的-您将极大地改变类的分布,而较小的类现在的可变性要小得多.较大的类别将具有丰富的变化,较小的类别将是具有小的仿射变换的许多相似图像.他们将居住在图像空间中的区域比大多数类别的区域小得多.

This would not be a standard approach to deal with unbalanced data. Nor do I think it would be really justified - you would be significantly changing the distributions of your classes, where the smaller class is now much less variable. The larger class would have rich variation, the smaller would be many similar images with small affine transforms. They would live on a much smaller region in image space than the majority class.

更标准的方法是:

  • model.fit中的class_weights参数,您可以使用该参数使模型从少数派类中学习更多.
  • 减少多数派的人数.
  • 接受不平衡.深度学习可以解决这一问题,它只需要更多的数据(实际上是一切的解决方案).

前两个选项确实是一种骇客,可能会损害您处理现实世界(不平衡)数据的能力.两者都不能真正解决因数据太少而固有的可变性低的问题.如果不关心模型训练后应用于现实世界的数据集,而您只是希望对所拥有的数据有良好的结果,那么这些选项就可以了(比为单个类生成生成器要容易得多).

The first two options are really kind of hacks, which may harm your ability to cope with real world (imbalanced) data. Neither really solves the problem of low variability, which is inherent in having too little data. If application to a real world dataset after model training isn't a concern and you just want good results on the data you have, then these options are fine (and much easier than making generators for a single class).

如果您有足够的数据,第三个选择是正确的方法(例如, Google最近发表的有关检测糖尿病性视网膜病变的论文在一个阳性病例在10%到30%之间的数据集中达到了很高的准确性.

The third option is the right way to go if you have enough data (as an example, the recent paper from Google about detecting diabetic retinopathy achieved high accuracy in a dataset where positive cases were between 10% and 30%).

如果您确实想为一个类别生成比另一个类别更多的增强图像,则在预处理中可能最容易做到.拍摄少数派类别的图像并生成一些增强版本,然后将其全部称为数据的一部分.就像我说的那样,这都是很棘手的事情.

If you truly want to generate a variety of augmented images for one class over another, it would probably be easiest to do it in pre-processing. Take the images of the minority class and generate some augmented versions, and just call it all part of your data. Like I say, this is all pretty hacky.

这篇关于用keras图像生成器平衡不平衡的数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆