Mask-RCNN控制遮罩调用的哪些参数? [英] Which parameters of Mask-RCNN control mask recall?

查看:64
本文介绍了Mask-RCNN控制遮罩调用的哪些参数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对微调我用于实例分割的Mask-RCNN模型感兴趣.目前,我已经为6个时期训练了模型,各种Mask-RCNN损失如下:

I'm interested in fine-tuning a Mask-RCNN model that I'm using for instance segmentation. Currently I have trained the model for 6 epochs and the various Mask-RCNN losses are as follows:

我要停止的原因是,COCO评估指标似乎已经进入了最后一个时期:

The reason I'm stopping is that the COCO evaluation metrics seem to have dipped in the last epoch:

我知道这是一个影响深远的问题,但是我希望获得一些直觉,以了解如何了解哪些参数对改善评估指标最有影响.我了解应该考虑三个地方:

I know this is a far reaching question, but I'm looking to gain some intuition of how to understand which parameters are going to be the most impactful in improving the evaluation metrics. I understand there are three places to consider:

  1. 我应该查看批量大小,学习率和动量吗,它使用学习速率为1e-4且批量大小为2的SGD优化器?
  2. 我应该使用更多的训练数据还是添加增强功能(我目前未使用),而我的数据集是当前相当大的40K图像?
  3. 我应该查看特定的MaskRCNN参数吗?

我可能会要求我更加具体地说明我要改进的地方,所以我要说的是,我想提高单个口罩的召回率.该模型运行良好,但是并没有完全体现我想要的功能.我也省略了特定学习问题的详细信息,因为我想对如何一般地解决这个问题有直觉.

I thing I'll likely be asked to me more specific on what I want to improve so let me say that I would like to improve the recall of the individual masks. The model is performing well but doesn't quite capture the full extend of what I would like it to. I'm also leaving out details of the specific learning problem as I'd like to gain intuition of how to approach this in general.

推荐答案

一些注意事项:

    即使您使用预先训练的网络,
  • 6个纪元对于网络收敛而言还是太少了.特别是像resnet50这样的大公司.我认为您至少需要50个纪元.在经过预训练的resnet18上,经过30个纪元后,我开始获得良好的结果,resnet34需要+ 10-20个纪元,而您的resnet50 + 40k火车图像集-肯定需要比6个更多的纪元;
  • 绝对使用预先训练的网络;
  • 根据我的经验,我无法获得SGD所需的结果.我开始使用AdamW + ReduceLROnPlateau调度程序.网络收敛非常快,例如在7-8阶段达到50-60%的AP,但是只有在LR足够小的情况下,在50-60历次之后,各个阶段之间的使用很小的改进后,网络才达到80-85.您必须熟悉梯度下降概念.我曾经认为它好像您有更多的增强,即您的"hill"被巨石"覆盖您必须能够绕过,并且只有在您控制LR的情况下才有可能.此外,AdamW可帮助解决过度拟合问题.
  • 6 epochs is a way too little number for the network to converge even if you use a pre-trained network. Especially such a big one as resnet50. I think you need at least 50 epochs. On a pre-trained resnet18 I started to get good results after 30 epochs, resnet34 needed +10-20 epochs and your resnet50 + 40k images of train set - definitely need more epochs than 6;
  • definitely use a pre-trained network;
  • in my experience, I failed to get results I like with SGD. I started using AdamW + ReduceLROnPlateau scheduler. Network converges quite fast, like 50-60% AP on epoch 7-8 but then it comes up to 80-85 after 50-60 epochs using very small improvements from epoch to epoch, only if the LR is small enough. You must be familiar with the gradient descent notion. I used to think of it as if you have more augmentation, your "hill" is covered with "boulders" that you have to be able to bypass and this is only possible if you control the LR. Additionally, AdamW helps with the overfitting.

这就是我的方法.对于具有更高输入分辨率的网络(您输入的图像由网络本身按输入比例缩放),我使用更高的lr.

This is how I do it. For networks with higher input resolution (you input images are scaled on input by the net itself), I use higher lr.

init_lr = 0.00005
weight_decay = init_lr * 100
optimizer = torch.optim.AdamW(params, lr=init_lr, weight_decay=weight_decay)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, verbose=True, patience=3, factor=0.75)

for epoch in range(epochs):
    # train for one epoch, printing every 10 iterations
    metric_logger = train_one_epoch(model, optimizer, train_loader, scaler, device,
                                    epoch, print_freq=10)
    
    scheduler.step(metric_logger.loss.global_avg)
    optimizer.param_groups[0]["weight_decay"] = optimizer.param_groups[0]["lr"] * 100

    # scheduler.step()

    # evaluate on the test dataset
    evaluate(model, test_loader, device=device)

    print("[INFO] serializing model to '{}' ...".format(args["model"]))
    save_and_print_size_of_model(model, args["model"], script=False)

找到这样的lr和重量衰减量,以使训练在训练结束时将lr消耗到非常小的值,例如初始lr的1/10.如果您经常遇到平稳期,则调度程序会迅速将其设置为很小的值,而在其余所有时期中,网络将一无所获.

Find such an lr and weight decay that the training exhausts lr to a very small value, like 1/10 of your initial lr, at the end of the training. If you will have a plateau too often, the scheduler quickly brings it to very small values and the network will learn nothing all the rest of the epochs.

您的图表明您的LR在训练的某个时刻过高,网络停止训练,然后AP掉线.您需要不断的改进,即使是很小的改进.网络训练的次数越多,它所了解的关于您的域的细节就越细,学习率越小.恕我直言,恒定的LR不允许正确执行此操作.

Your plots indicate that your LR is too high at some point of the training, the network stops training and then AP is going down. You need constant improvements, even small ones. The more network trains the more subtle details it learns about your domain and the smaller is the learning rate. Imho, constant LR will not allow doing that correctly.

  • 锚生成器设置.这是我初始化网络的方法.

  • anchor generator settings. Here is how I initialize the network.

 def get_maskrcnn_resnet_model(name, num_classes, pretrained, res='normal'):
      print('Using maskrcnn with {} backbone...'.format(name))


      backbone = resnet_fpn_backbone(name, pretrained=pretrained, trainable_layers=5)


      sizes = ((4,), (8,), (16,), (32,), (64,))
      aspect_ratios = ((0.25, 0.5, 1.0, 2.0, 4.0),) * len(sizes)
      anchor_generator = AnchorGenerator(
          sizes=sizes, aspect_ratios=aspect_ratios
      )

      roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=['0', '1', '2', '3'],
                                                      output_size=7, sampling_ratio=2)

      default_min_size = 800
      default_max_size = 1333

      if res == 'low':
          min_size = int(default_min_size / 1.25)
          max_size = int(default_max_size / 1.25)
      elif res == 'normal':
          min_size = default_min_size
          max_size = default_max_size
      elif res == 'high':
          min_size = int(default_min_size * 1.25)
          max_size = int(default_max_size * 1.25)
      else:
          raise ValueError('Invalid res={} param'.format(res))

      model = MaskRCNN(backbone, min_size=min_size, max_size=max_size, num_classes=num_classes,
                       rpn_anchor_generator=anchor_generator, box_roi_pool=roi_pooler)

      model.roi_heads.detections_per_img = 512
      return model

我需要在这里找到小物件,为什么要使用这种锚定参数.

I need to find small objects here why I use such anchor params.

  • 分类平衡问题.如果只有对象和背景-没问题.如果您有更多的课程,请确保将您的培训分配(培训的80%和测试的20%)或多或少地精确地应用于您的特定培训中使用的所有课程.

祝你好运!

这篇关于Mask-RCNN控制遮罩调用的哪些参数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆