经过一段时间的训练后，训练损失值正在增加，但是该模型可以很好地检测出物体 [英] Training loss value is increasing after some training time, but the model detects objects pretty good

查看：739 发布时间：2020/5/4 6:00:46 tensorflow object-detection loss-function

本文介绍了经过一段时间的训练后，训练损失值正在增加，但是该模型可以很好地检测出物体的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在训练CNN从我自己的数据集中检测对象时遇到一个奇怪的问题.我正在使用转移学习，并且在培训开始时，损失值正在降低(如预期的那样).但是一段时间之后，它变得越来越高，我不知道为什么会发生.

I encounter a strange problem while training CNN to detect objects from my own dataset. I am using transfer learning and at the beginning of training, the loss value is decreasing (as expected). But after some time, it gets higher and higher, and I have no idea why it happens.

同时，当我查看Tensorboard上的 Images 标签时，检查CNN预测对象的能力如何，我可以看到它做得很好，看起来并不像现在随着时间的推移变得越来越糟.而且，精确度"和召回率"图表看起来不错，只有损失"图表(尤其是category_loss)显示出随时间增加的趋势.

At the same time, when I look at Images tab on Tensorboard to check how well the CNN predicts objects, I can see that it does it very well, it doesn't look as it is getting worse over time. Also, the Precision and Recall charts look good, only the Loss charts (especially classification_loss) show an increasing trend over time.

以下是一些具体细节:

我有10种不同类别的徽标(例如DHL，宝马，联邦快递等)
每堂课约600张图像
我在Ubuntu 18.04上使用tensorflow-gpu
我尝试了多个预先训练的模型，最新的模型是 faster_rcnn_resnet101_coco ，它具有以下配置管道:

I have 10 different classes of logos (such as DHL, BMW, FedEx, etc.)
Around 600 images per class
I use tensorflow-gpu on Ubuntu 18.04
I tried multiple pre-trained models, the latest being faster_rcnn_resnet101_coco with this config pipeline:

model {
  faster_rcnn {
    num_classes: 10
    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 600
        max_dimension: 1024
      }
    }
    feature_extractor {
      type: 'faster_rcnn_resnet101'
      first_stage_features_stride: 16
    }
    first_stage_anchor_generator {
      grid_anchor_generator {
        scales: [0.25, 0.5, 1.0, 2.0]
        aspect_ratios: [0.5, 1.0, 2.0]
        height_stride: 16
        width_stride: 16
      }
    }
    first_stage_box_predictor_conv_hyperparams {
      op: CONV
      regularizer {
        l2_regularizer {
          weight: 0.0
        }
      }
      initializer {
        truncated_normal_initializer {
          stddev: 0.01
        }
      }
    }
    first_stage_nms_score_threshold: 0.0
    first_stage_nms_iou_threshold: 0.7
    first_stage_max_proposals: 300
    first_stage_localization_loss_weight: 2.0
    first_stage_objectness_loss_weight: 1.0
    initial_crop_size: 14
    maxpool_kernel_size: 2
    maxpool_stride: 2
    second_stage_box_predictor {
      mask_rcnn_box_predictor {
        use_dropout: false
        dropout_keep_probability: 1.0
        fc_hyperparams {
          op: FC
          regularizer {
            l2_regularizer {
              weight: 0.0
            }
          }
          initializer {
            variance_scaling_initializer {
              factor: 1.0
              uniform: true
              mode: FAN_AVG
            }
          }
        }
      }
    }
    second_stage_post_processing {
      batch_non_max_suppression {
        score_threshold: 0.0
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 300
      }
      score_converter: SOFTMAX
    }
    second_stage_localization_loss_weight: 2.0
    second_stage_classification_loss_weight: 1.0
  }
}

train_config: {
  batch_size: 1
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        manual_step_learning_rate {
          initial_learning_rate: 0.0003
          schedule {
            step: 900000
            learning_rate: .00003
          }
          schedule {
            step: 1200000
            learning_rate: .000003
          }
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  gradient_clipping_by_norm: 10.0
  fine_tune_checkpoint: "/home/franciszek/Pobrane/models-master/research/object_detection/logo_detection/models2/faster_rcnn_resnet101_coco/model.ckpt"
  from_detection_checkpoint: true
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "/home/franciszek/Pobrane/models-master/research/object_detection/logo_detection/data2/train.record"
  }
  label_map_path: "/home/franciszek/Pobrane/models-master/research/object_detection/logo_detection/data2/label_map.pbtxt"
}

eval_config: {
  num_examples: 8000
  # Note: The below line limits the evaluation process to 10 evaluations.
  # Remove the below line to evaluate indefinitely.
  max_evals: 10
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "/home/franciszek/Pobrane/models-master/research/object_detection/logo_detection/data2/test.record"
  }
  label_map_path: "/home/franciszek/Pobrane/models-master/research/object_detection/logo_detection/data2/label_map.pbtxt"
  shuffle: false
  num_readers: 1
}

在这里，您可以看到经过将近23小时的训练并且达到了12万多步的训练结果:

Here you can see results that I get after training for nearly 23 hours and reaching over 120k steps:

Loss and Total Loss
Precision

所以，我的问题是，为什么损失值会随着时间增加?它应该变得更小或保持大致恒定，但是您可以在上面的图表中清楚地看到上升的趋势. 我认为一切都已正确配置，并且我的数据集相当不错(而且.tfrecord文件也已正确构建").

So, my question is, why is the loss value increasing over time? It should be getting smaller or stay more or less constant, but you can clearly see the increasing trend in the above charts. I think everything is properly configured and my dataset is pretty decent (also .tfrecord files were correctly "built").

要检查是否是我的错，我尝试使用其他用户的数据集和配置文件.因此，我使用了浣熊数据集作者的文件(他在他的仓库中提供了所有必要的文件) .我只是下载了它们，并开始进行了未经修改的培训，以检查我是否会得到与他相似的结果.

To check if it is my fault I tried to use somebody's else dataset and configuration files. So I used the racoon dataset author's files (he provided all of the necessary files on his repo). I just downloaded them and started training with no modifications to check if I would get similar results as him.

令人惊讶的是，经过82k步长之后，我得到的图表与链接文章中显示的图表完全不同(在22k步长之后捕获的图表).在这里，您可以看到我们的结果比较:

Surprisingly, after 82k steps, I got entirely different charts than the ones shown in the linked article (that were captured after 22k steps). Here you can see the comparison of our results:

我的损失与我的精度与

My losses vs his TotalLoss
My precision vs his mAP

很明显，我的PC上的工作方式有所不同.我怀疑这可能是我在自己的数据集上损失越来越大的同一原因，这就是我提到它的原因.

Clearly, something worked differently on my PC. I suspect it may be the same reason why I get increasing loss on my own dataset, that's why I mentioned it.

经过一段时间的训练后，训练损失值正在增加，但是该模型可以很好地检测出物体 [英] Training loss value is increasing after some training time, but the model detects objects pretty good

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

经过一段时间的训练后，训练损失值正在增加，但是该模型可以很好地检测出物体 [英] Training loss value is increasing after some training time, but the model detects objects pretty good

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭