分类:如果一个类别的数据量是另一类别的4倍,会发生什么情况? [英] Classification: What happens if one class has 4 times as much data as the other class?

查看:73
本文介绍了分类:如果一个类别的数据量是另一类别的4倍,会发生什么情况?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试调试分类器的问题.问题是尽管精确度接近80%,但对于给定的输入,它总是会预测相同的类别.

I am trying to debug an issue with my classifier. The issue is that it always predicts the same class for a given input despite having close to an 80% accuracy.

我训练了CNN以检测2个班级之间的差异. A类有2575 jpeg,B类有665 jpeg.

I trained my CNN to detect the difference between 2 classes. class A has 2575 jpegs and class B has 665 jpegs.

这是否可能导致我的CNN总是预测同一类的问题?每个类中的项目数量之间是否太不平衡了?总的来说,如果我将两个类的大小设置为相同(665 jpegs),我的性能会提高吗?

Could this have caused my issue with my CNN always predicting the same class? Is this too much of an imbalance between the # of items in each class? In general, will my performance improve if I make the size of both classes the same(at 665 jpegs?)?

推荐答案

问题似乎是类不平衡的情况,并且有不同的处理方法:

The problem seems to be a case of class imbalance and there are different ways to handle it:

  1. 加权损失: 您可以通过计算加权交叉熵.
  2. 重新采样数据: 如前所述,您还可以对多数类别进行下采样,以平衡类别.您还可以对少数派类别进行升采样以使其达到平均水平.
  3. 生成增强数据: :由于要处理图像,因此可以对少数类进行升采样,然后在这些图像上使用data augmentation,这也解决了类不平衡问题解决过度拟合问题并提高泛化能力.
  4. 以及以上所有内容的组合.
  1. Weighted loss: You can penalise the reward for the majority loss function by computing a weighted cross entropy.
  2. Resampling the data: As you mentioned you can also downsample the majority class, to balance the classes. You can also upsample the minority class to make it even.
  3. Generate augmented data: Since you are handling images, you can upsample the minority class and then use data augmentation on those images, this solves the class imbalance as well as tackles overfitting and improves generalisation.
  4. and Combination of all the above.

这篇关于分类:如果一个类别的数据量是另一类别的4倍,会发生什么情况?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆