使用 keras 图像生成器平衡不平衡的数据集 [英] balancing an imbalanced dataset with keras image generator

查看:38
本文介绍了使用 keras 图像生成器平衡不平衡的数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

凯拉斯

ImageDataGenerator

可用于通过实时数据增强生成批量张量图像数据"

教程这里 演示了如何使用 ImageDataGenerator 扩充小而平衡的数据集.是否有一种简单的方法可以使用此生成器来扩充严重不平衡的数据集,从而使生成的数据集达到平衡?

The tutorial here demonstrates how a small but balanced dataset can be augmented using the ImageDataGenerator. Is there an easy way to use this generator to augment a heavily unbalanced dataset, such that the resulting, generated dataset is balanced?

推荐答案

这不是处理不平衡数据的标准方法.我也不认为这真的有道理——你会显着改变你的班级的分布,现在更小的班级的变化要小得多.较大的类将具有丰富的变化,较小的类将具有较小的仿射变换的许多相似图像.与多数类相比,它们将生活在图像空间中小得多的区域.

This would not be a standard approach to deal with unbalanced data. Nor do I think it would be really justified - you would be significantly changing the distributions of your classes, where the smaller class is now much less variable. The larger class would have rich variation, the smaller would be many similar images with small affine transforms. They would live on a much smaller region in image space than the majority class.

更标准的方法是:

  • model.fit 中的 class_weights 参数,您可以使用它使模型从少数类中学到更多.
  • 减少多数类的大小.
  • 接受不平衡.深度学习可以解决这个问题,它只需要更多的数据(真的是一切的解决方案).

前两个选项实际上是一种黑客行为,可能会损害您处理现实世界(不平衡)数据的能力.两者都没有真正解决低可变性的问题,这是数据太少所固有的.如果在模型训练后应用到真实世界的数据集不是问题,而您只想在现有数据上获得好的结果,那么这些选项就很好(而且比为单个类制作生成器要容易得多).

The first two options are really kind of hacks, which may harm your ability to cope with real world (imbalanced) data. Neither really solves the problem of low variability, which is inherent in having too little data. If application to a real world dataset after model training isn't a concern and you just want good results on the data you have, then these options are fine (and much easier than making generators for a single class).

如果您有足够的数据(例如,Google 最近发表的一篇关于检测糖尿病视网膜病变的论文在阳性病例数在 10% 到 30% 之间的数据集中取得了很高的准确性.

The third option is the right way to go if you have enough data (as an example, the recent paper from Google about detecting diabetic retinopathy achieved high accuracy in a dataset where positive cases were between 10% and 30%).

如果您真的想为一个类别生成各种增强图像,那么在预处理中进行这可能是最简单的.获取少数类的图像并生成一些增强版本,并将其称为数据的所有部分.就像我说的,这一切都非常糟糕.

If you truly want to generate a variety of augmented images for one class over another, it would probably be easiest to do it in pre-processing. Take the images of the minority class and generate some augmented versions, and just call it all part of your data. Like I say, this is all pretty hacky.

这篇关于使用 keras 图像生成器平衡不平衡的数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆