经过一段时间的训练后,训练损失值正在增加,但是该模型可以很好地检测出物体 [英] Training loss value is increasing after some training time, but the model detects objects pretty good

查看:739
本文介绍了经过一段时间的训练后,训练损失值正在增加,但是该模型可以很好地检测出物体的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在训练CNN从我自己的数据集中检测对象时遇到一个奇怪的问题.我正在使用转移学习,并且在培训开始时,损失值正在降低(如预期的那样).但是一段时间之后,它变得越来越高,我不知道为什么会发生.

I encounter a strange problem while training CNN to detect objects from my own dataset. I am using transfer learning and at the beginning of training, the loss value is decreasing (as expected). But after some time, it gets higher and higher, and I have no idea why it happens.

同时,当我查看Tensorboard上的 Images 标签时,检查CNN预测对象的能力如何,我可以看到它做得很好,看起来并不像现在随着时间的推移变得越来越糟.而且,精确度"和召回率"图表看起来不错,只有损失"图表(尤其是category_loss)显示出随时间增加的趋势.

At the same time, when I look at Images tab on Tensorboard to check how well the CNN predicts objects, I can see that it does it very well, it doesn't look as it is getting worse over time. Also, the Precision and Recall charts look good, only the Loss charts (especially classification_loss) show an increasing trend over time.

以下是一些具体细节:

  • 我有10种不同类别的徽标(例如DHL,宝马,联邦快递等)
  • 每堂课约600张图像
  • 我在Ubuntu 18.04上使用tensorflow-gpu
  • 我尝试了多个预先训练的模型,最新的模型是 faster_rcnn_resnet101_coco ,它具有以下配置管道:

  • I have 10 different classes of logos (such as DHL, BMW, FedEx, etc.)
  • Around 600 images per class
  • I use tensorflow-gpu on Ubuntu 18.04
  • I tried multiple pre-trained models, the latest being faster_rcnn_resnet101_coco with this config pipeline:

model {
  faster_rcnn {
    num_classes: 10
    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 600
        max_dimension: 1024
      }
    }
    feature_extractor {
      type: 'faster_rcnn_resnet101'
      first_stage_features_stride: 16
    }
    first_stage_anchor_generator {
      grid_anchor_generator {
        scales: [0.25, 0.5, 1.0, 2.0]
        aspect_ratios: [0.5, 1.0, 2.0]
        height_stride: 16
        width_stride: 16
      }
    }
    first_stage_box_predictor_conv_hyperparams {
      op: CONV
      regularizer {
        l2_regularizer {
          weight: 0.0
        }
      }
      initializer {
        truncated_normal_initializer {
          stddev: 0.01
        }
      }
    }
    first_stage_nms_score_threshold: 0.0
    first_stage_nms_iou_threshold: 0.7
    first_stage_max_proposals: 300
    first_stage_localization_loss_weight: 2.0
    first_stage_objectness_loss_weight: 1.0
    initial_crop_size: 14
    maxpool_kernel_size: 2
    maxpool_stride: 2
    second_stage_box_predictor {
      mask_rcnn_box_predictor {
        use_dropout: false
        dropout_keep_probability: 1.0
        fc_hyperparams {
          op: FC
          regularizer {
            l2_regularizer {
              weight: 0.0
            }
          }
          initializer {
            variance_scaling_initializer {
              factor: 1.0
              uniform: true
              mode: FAN_AVG
            }
          }
        }
      }
    }
    second_stage_post_processing {
      batch_non_max_suppression {
        score_threshold: 0.0
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 300
      }
      score_converter: SOFTMAX
    }
    second_stage_localization_loss_weight: 2.0
    second_stage_classification_loss_weight: 1.0
  }
}

train_config: {
  batch_size: 1
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        manual_step_learning_rate {
          initial_learning_rate: 0.0003
          schedule {
            step: 900000
            learning_rate: .00003
          }
          schedule {
            step: 1200000
            learning_rate: .000003
          }
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  gradient_clipping_by_norm: 10.0
  fine_tune_checkpoint: "/home/franciszek/Pobrane/models-master/research/object_detection/logo_detection/models2/faster_rcnn_resnet101_coco/model.ckpt"
  from_detection_checkpoint: true
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "/home/franciszek/Pobrane/models-master/research/object_detection/logo_detection/data2/train.record"
  }
  label_map_path: "/home/franciszek/Pobrane/models-master/research/object_detection/logo_detection/data2/label_map.pbtxt"
}

eval_config: {
  num_examples: 8000
  # Note: The below line limits the evaluation process to 10 evaluations.
  # Remove the below line to evaluate indefinitely.
  max_evals: 10
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "/home/franciszek/Pobrane/models-master/research/object_detection/logo_detection/data2/test.record"
  }
  label_map_path: "/home/franciszek/Pobrane/models-master/research/object_detection/logo_detection/data2/label_map.pbtxt"
  shuffle: false
  num_readers: 1
}

在这里,您可以看到经过将近23小时的训练并且达到了12万多步的训练结果:

Here you can see results that I get after training for nearly 23 hours and reaching over 120k steps:

  • Loss and Total Loss
  • Precision

所以,我的问题是,为什么损失值会随着时间增加?它应该变得更小或保持大致恒定,但是您可以在上面的图表中清楚地看到上升的趋势. 我认为一切都已正确配置,并且我的数据集相当不错(而且.tfrecord文件也已正确构建").

So, my question is, why is the loss value increasing over time? It should be getting smaller or stay more or less constant, but you can clearly see the increasing trend in the above charts. I think everything is properly configured and my dataset is pretty decent (also .tfrecord files were correctly "built").

要检查是否是我的错,我尝试使用其他用户的数据集和配置文件.因此,我使用了浣熊数据集作者的文件(他在他的仓库中提供了所有必要的文件) .我只是下载了它们,并开始进行了未经修改的培训,以检查我是否会得到与他相似的结果.

To check if it is my fault I tried to use somebody's else dataset and configuration files. So I used the racoon dataset author's files (he provided all of the necessary files on his repo). I just downloaded them and started training with no modifications to check if I would get similar results as him.

令人惊讶的是,经过82k步长之后,我得到的图表与链接文章中显示的图表完全不同(在22k步长之后捕获的图​​表).在这里,您可以看到我们的结果比较:

Surprisingly, after 82k steps, I got entirely different charts than the ones shown in the linked article (that were captured after 22k steps). Here you can see the comparison of our results:

  • My losses vs his TotalLoss
  • My precision vs his mAP

很明显,我的PC上的工作方式有所不同.我怀疑这可能是我在自己的数据集上损失越来越大的同一原因,这就是我提到它的原因.

Clearly, something worked differently on my PC. I suspect it may be the same reason why I get increasing loss on my own dataset, that's why I mentioned it.

推荐答案

totalLoss是其他四个损失的加权和. (RPN cla和reg损失,BoxCla cla和reg损失),它们都是评估损失.在tensorboard上,您可以选中或取消选中以查看评估结果(仅用于训练或仅用于评估). (例如,下面的图片有火车摘要和评估摘要)

The totalLoss is the weighted sum of those four other losses. (RPN cla and reg losses, BoxCla cla and reg losses) and they are all Evaluation loss. On tensorboard you can check or uncheck to see the evaluation results for training only or for evaluation only. (For example, the following pic has train summary and evaluation summary)

如果评估损失在增加,则可能表明模型过拟合,此外,精度指标略有下降.

If the evaluation loss is increasing, this might suggest an overfitting model, besides, the precision metrics dropped a little bit.

要尝试更好的微调结果,可以尝试调整四个损失的权重,例如,可以增加BoxClassifierLoss/classification_loss的权重,以使模型更好地关注此指标.在您的配置文件中,second_stage_classification_loss_weightfirst_stage_objectness_loss_weight的损失权重均为1,而其他两个均为2,因此该模型当前将重点放在其他两个方面.

To try a better fine-tuning result, you may try adjusting the weights of the four losses, for example, you may increase the weight for BoxClassifierLoss/classification_loss to let the model focused on this metric better. In your config file, the loss weight for second_stage_classification_loss_weight and first_stage_objectness_loss_weight are both 1 while the other two are both 2, so the model currently focused on the other two a little more.

关于为什么loss_1loss_2相同的一个额外问题.这可以通过查看张量流图来解释.

An extra question about why loss_1 and loss_2 are the same. This can be explained by looking at the tensorflow graph.

此处loss_2total_loss的摘要,(请注意,total_loss与totalLoss中的不相同),红色圆圈的节点是tf.identity节点.该节点将输出与输入相同的张量,因此loss_1loss_2

Here loss_2 is the summary for total_loss, (note this total_loss is not the same as in totalLoss) and the red-circled node is a tf.identity node. This node will output the same tensor as the input, so loss_1 is the same as loss_2

这篇关于经过一段时间的训练后,训练损失值正在增加,但是该模型可以很好地检测出物体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆