大班级失衡培训 [英] Training with big class imbalance

查看:75
本文介绍了大班级失衡培训的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个三部分的问题

1)类的大小-我正在5个类上训练TF对象检测API,这些类的大小彼此之间并不接近:

1) Class size - i'm training the TF object detection API on 5 classes, where sizes aren't anywhere close to each other:

  1. 不.类别1中的图像数量:401
  2. 不.类别2中的图像数量:389
  3. 不.类别3中的图像数量:532
  4. 不.类别4中的图像数量:159393
  5. 不.类别5中的图片数量:185313

(总计

这不是训练典型的图像分类器,所以我想这实际上不是类不平衡的问题,但我想知道这是否会影响结果模型

This isn't training a typical image classifier so I'm guessing this isn't really an issue of class imbalance, but im wondering if it would affect the outcome model

2)可以使用TF对象检测API来检测两个对象(其中一个对象被另一个对象包围/包围)吗?

2) Can TF object detection API be used to detect two objects where 1 is enclosed / bounded by the other?

例如脸与人-脸在人的范围之内

Ex. face vs person - face is within the bounds of the person

3)这是我发现的继续使用Faster RCNN意味着必须将batch_size设置为1.

3) This is a continuation where I found that using Faster RCNN means batch_size has to be set to 1.

由于这个原因,我不确定这是否意味着我必须在训练期间等待全局步骤才能匹配训练集中的图像数量(在我的自定义数据集中大约为340k).我在带有4 vCPU和15gig RAM的Google计算上使用w/12 GB内存的Tesla k80 GPU.大约2天后,我发现损失远低于1:

And because of this, I am not sure if this means that I have to wait for global step during training to match the # of images in the training set (approx 340k in my custom data set). I am using Tesla k80 GPU w/12 GB memory on Google compute w/4 vCPU and 15gig RAM. After about 2 days, i see loss hitting well below 1 though:

INFO:tensorflow:全局步骤264250:损耗= 0.2799(0.755秒/步)

INFO:tensorflow:global step 264250: loss = 0.2799 (0.755 sec/step)

INFO:tensorflow:全局步骤264251:损失= 0.0271(0.787秒/步)

INFO:tensorflow:global step 264251: loss = 0.0271 (0.787 sec/step)

INFO:tensorflow:全局步骤264252:损失= 0.1122(0.677秒/步)

INFO:tensorflow:global step 264252: loss = 0.1122 (0.677 sec/step)

INFO:tensorflow:全局步骤264253:损耗= 0.1709(0.797秒/步)

INFO:tensorflow:global step 264253: loss = 0.1709 (0.797 sec/step)

INFO:tensorflow:全局步骤264254:损耗= 0.8366(0.790秒/步)

INFO:tensorflow:global step 264254: loss = 0.8366 (0.790 sec/step)

INFO:tensorflow:全局步264255:损耗= 0.0541(0.741秒/步)

INFO:tensorflow:global step 264255: loss = 0.0541 (0.741 sec/step)

INFO:tensorflow:全局步骤264256:损耗= 0.0760(0.781秒/步)

INFO:tensorflow:global step 264256: loss = 0.0760 (0.781 sec/step)

INFO:tensorflow:全局步骤264257:损耗= 0.0621(0.777秒/步)

INFO:tensorflow:global step 264257: loss = 0.0621 (0.777 sec/step)

如何确定何时停止?我注意到,直到这里,我才从最新的检查点文件生成的冻结推理图似乎只能检测到具有最多图像数量(即人脸)的类,而没有检测到其他任何东西.

How do I determine when to stop? I noticed even until here, the frozen inference graph that I generate from the latest checkpoint file ONLY seems to detect the class w/ the most number of images (i.e. face) and doesn't detect anything else.

推荐答案

1)是的,它将以某种方式影响结果.更准确地说,您的模型将非常擅长识别5级和4级,并且可能对其他模型有所了解.考虑将[4,5]的实例数限制为至少与其他类别相同.这在开始时特别有用,因此可以使每个类得到均衡的表示.

1) Yes, it will affect the outcome in some way. More precisely, your model will be very good at recognising class 5 and class 4, and it may have an idea about the others. Consider limiting the number of instances of [4, 5] to be at least in the same order of magnitude as the other classes. This would be useful especially in the beginning, so it makes a balanced representation of each class.

这里非常重要的是使用数据扩充(请参见此答案).

Also very important here is to use data augmentation (see this answer).

3)通常,您的模型应该花几个纪元来进行良好的训练,尤其是当您进行数据扩充时.

3) Normally, your model should take several epochs to train well, especially when you have data augmentation.

这是关于SO和存储库中问题的无处不在的内容:您不能仅从损失的价值中得知它是否收敛!.考虑这种情况:对于输入图像,您有shuffle: True,在第4类和第5类中有344,706张图像.如果随机播放对它们进行了排列,以使这些图像出现在来自[1,2,3]类的图像之前,那么您的模型学到了一些好表示法,但由于过拟合的原因,如果遇到超限,则会遇到1类图像.因此,您的损失将跃升至非常高的水平.

This is written everywhere on SO and on the issues in the repository: you cannot know if it converged from the values of the loss alone ! . Consider this scenario: you have shuffle: True for your input images, 344,706 images in classes 4 and 5. If the shuffle arranged them so that these images came before those from classes [1,2,3], then your model learnt some good representation so far, but when it will encounter an image of class 1 if will overshoot, because of overfitting. So your loss will jump to some very high value.

解决方案是并行运行eval.py,因为这可以使您了解模型如何在 all 类上执行.当您对该指标感到满意时,就可以停下来.

The solution is to run eval.py in parallel, as that gives you an idea of how the model performs on all classes. And you can stop when you're sattisfied with that metric.

请注意,在StackOverflow上,单独询问是否解决不同主题是很正常的,因为我们不仅为您提供答复,而且还为您当前职位上的所有未来人提供答复.

Note it is normal on StackOverflow to ask separate questions if they address different subjects because we are answering for you but also for all the future people in your current position.

所以我将在另一种答案中回答2):)

So I'll answer 2) in a different one :)

这篇关于大班级失衡培训的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆