TensorFlow - ValueError:检查点版本应该是 V2 [英] TensorFlow - ValueError: Checkpoint version should be V2

查看:51
本文介绍了TensorFlow - ValueError:检查点版本应该是 V2的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  • GPU:NVIDIA GEFORCE RTX 2060
  • GPU:16GB RAM,6 个处理器内核
  • TensorFlow:2.3.1
  • Python:3.8.6
  • CUDA:10.1
  • cuDNN:7.6

我正在训练一个 Mask R-CNN Inception ResNet V2 1024x1024 算法(在我电脑的 GPU 上),从 TensorFlow 2 Detection Model Zoo. 我正在我的自定义数据集上训练这个算法,我使用 标签-img.当我使用 Anaconda 命令 python model_main_tf2.py --model_dir=models/my_faster_rcnn --pipeline_config_path=models/my_faster_rcnn/pipeline.config 训练模型时,出现以下错误:

回溯(最近一次调用最后一次):文件model_main_tf2.py",第 113 行,在 <module> 中.tf.compat.v1.app.run()文件C:\user\anaconda3\envs\object_detection_api\lib\site-packages\tensorflow\python\platform\app.py",第 40 行,运行中_run(main=main,argv=argv,flags_parser=_parse_flags_tolerate_undef)文件C:\user\anaconda3\envs\object_detection_api\lib\site-packages\absl\app.py",第303行,运行中_run_main(main, args)文件C:\user\anaconda3\envs\object_detection_api\lib\site-packages\absl\app.py",第 251 行,在 _run_main 中sys.exit(main(argv))文件model_main_tf2.py",第 104 行,在主目录中model_lib_v2.train_loop(文件C:\user\anaconda3\envs\object_detection_api\lib\site-packages\object_detection\model_lib_v2.py",第564行,train_loopload_fine_tune_checkpoint(检测模型,文件C:\user\anaconda3\envs\object_detection_api\lib\site-packages\object_detection\model_lib_v2.py",第348行,load_fine_tune_checkpointraise ValueError('检查点版本应该是 V2')ValueError:检查点版本应该是 V2

解决此错误所需的代码是什么?(以下是错误中引用的一些脚本):

model_main_tf2.py:

# Lint 为:python3# 版权所有 2020 TensorFlow 作者.版权所有.## 根据 Apache 许可,2.0 版(许可")获得许可;# 除非遵守许可,否则您不得使用此文件.# 您可以在以下网址获得许可证的副本## http://www.apache.org/licenses/LICENSE-2.0## 除非适用法律要求或书面同意,否则软件# 根据许可证分发是按原样"分发的;基础,# 没有任何形式的明示或暗示的保证或条件.# 请参阅许可证以了解管理权限的特定语言和# 许可证下的限制.# ================================================================================r"创建并运行TF2对象检测模型.对于本地培训/评估运行:PIPELINE_CONFIG_PATH=path/to/pipeline.configMODEL_DIR=/tmp/model_outputsNUM_TRAIN_STEPS=10000SAMPLE_1_OF_N_EVAL_EXAMPLES=1python model_main_tf2.py -- \--model_dir=$MODEL_DIR --num_train_steps=$NUM_TRAIN_STEPS \--sample_1_of_n_eval_examples=$SAMPLE_1_OF_N_EVAL_EXAMPLES \--pipeline_config_path=$PIPELINE_CONFIG_PATH \--alsologtostderr"从 absl 导入标志将 tensorflow.compat.v2 导入为 tf从 object_detection 导入 model_lib_v2flags.DEFINE_string('pipeline_config_path', None, '管道配置路径''文件.')flags.DEFINE_integer('num_train_steps', None, '列车步数.')flags.DEFINE_bool('eval_on_train_data', False, '在火车上启用评估''数据(仅在分布式训练中支持).')flags.DEFINE_integer('sample_1_of_n_eval_examples', None, '将采样其中之一''每 n 个 eval 输入示例,其中提供了 n.')flags.DEFINE_integer('sample_1_of_n_eval_on_train_examples', 5, '将采样''每 n 个用于评估的训练输入示例之一,''其中提供了 n.这仅在''eval_training_data' 为真.')flags.DEFINE_string('model_dir', None, '输出模型目录的路径''将写入事件和检查点文件的位置.')flags.DEFINE_string('checkpoint_dir', None, '保存检查点的目录的路径.如果 ''`checkpoint_dir` 提供,这个二进制文件在 eval-only 模式下运行,''将结果指标写入`model_dir`.')flags.DEFINE_integer('eval_timeout', 3600, '等待的秒数''退出前的评估检查点.')flags.DEFINE_bool('use_tpu', False, '作业是否在 TPU 上执行.')flags.DEFINE_string('tpu_name',默认=无,help='用于集群解析器的 Cloud TPU 的名称.')flags.DEFINE_integer('num_workers', 1, '当 num_workers >1、训练用''MultiWorkerMirroredStrategy.当 num_workers = 1 时,它使用 ''镜像策略.')flags.DEFINE_integer('checkpoint_every_n', 1000, '整数定义我们检查点的频率.')flags.DEFINE_boolean('record_summaries', True,('是否在期间记录摘要'' 训练.'))标志 = flags.FLAGSdef main(unused_argv):flags.mark_flag_as_required('model_dir')flags.mark_flag_as_required('pipeline_config_path')tf.config.set_soft_device_placement(真)如果 FLAGS.checkpoint_dir:model_lib_v2.eval_continuously(pipeline_config_path=FLAGS.pipeline_config_path,model_dir=FLAGS.model_dir,train_steps=FLAGS.num_train_steps,sample_1_of_n_eval_examples=FLAGS.sample_1_of_n_eval_examples,sample_1_of_n_eval_on_train_examples=(FLAGS.sample_1_of_n_eval_on_train_examples),checkpoint_dir=FLAGS.checkpoint_dir,wait_interval=300, timeout=FLAGS.eval_timeout)别的:如果 FLAGS.use_tpu:# 如果 tpu_name 为 None 和# 我们在云人工智能平台下运行.解析器 = tf.distribute.cluster_resolver.TPUClusterResolver(FLAGS.tpu_name)tf.config.experimental_connect_to_cluster(解析器)tf.tpu.experimental.initialize_tpu_system(解析器)策略 = tf.distribute.experimental.TPUStrategy(resolver)elif FLAGS.num_workers >1:策略 = tf.distribute.experimental.MultiWorkerMirroredStrategy()别的:策略 = tf.compat.v2.distribute.MirroredStrategy()使用 strategy.scope():model_lib_v2.train_loop(pipeline_config_path=FLAGS.pipeline_config_path,model_dir=FLAGS.model_dir,train_steps=FLAGS.num_train_steps,use_tpu=FLAGS.use_tpu,checkpoint_every_n=FLAGS.checkpoint_every_n,record_summaries=FLAGS.record_summaries)如果 __name__ == '__main__':tf.compat.v1.app.run()

pipeline.config 文件:

# Mask R-CNN with Inception Resnet v2(无孔)# 在 COCO(8 个 GPU)上同步训练,批量大小为 16(1024x1024 分辨率)# 从 Imagenet 分类检查点初始化# TF2 兼容,*不* TPU 兼容## 在 COCO 上实现 XXX mAP模型 {fast_rcnn {number_of_stages: 3num_classes: 1image_resizer {fixed_shape_resizer {高度:1024宽度:1024# pad_to_max_dimension: 真}}特征提取器{类型:'faster_rcnn_inception_resnet_v2_keras'}first_stage_anchor_generator {grid_anchor_generator {尺度:[0.25, 0.5, 1.0, 2.0]纵横比:[0.5, 1.0, 2.0]高度_步幅:16宽度_步幅:16}}first_stage_box_predictor_conv_hyperparams {操作:转换正则化{l2_regularizer {重量:0.0}}初始化程序{truncated_normal_initializer {标准差:0.01}}}first_stage_nms_score_threshold: 0.0first_stage_nms_iou_threshold:0.7first_stage_max_proposals: 300first_stage_localization_loss_weight:2.0first_stage_objectness_loss_weight:1.0initial_crop_size: 17maxpool_kernel_size: 1最大池步幅:1second_stage_box_predictor {mask_rcnn_box_predictor {use_dropout: 假dropout_keep_probability: 1.0fc_hyperparams {操作:FC正则化{l2_regularizer {重量:0.0}}初始化程序{方差_缩放_初始化器{系数:1.0制服:真的模式:FAN_AVG}}}掩码高度:33掩码宽度:33mask_prediction_conv_depth: 0mask_prediction_num_conv_layers: 4conv_hyperparams {操作:转换正则化{l2_regularizer {重量:0.0}}初始化程序{truncated_normal_initializer {标准差:0.01}}}predict_instance_masks:真}}second_stage_post_processing {batch_non_max_suppression {分数阈值:0.0iou_threshold:0.6max_detections_per_class:100max_total_detections: 100}score_converter: SOFTMAX}second_stage_localization_loss_weight:2.0second_stage_classification_loss_weight: 1.0second_stage_mask_prediction_loss_weight:4.0resize_masks: 假}}火车配置:{批量大小:1num_steps: 200000优化器{动量优化器:{学习率:{cosine_decay_learning_rate {learning_rate_base: 0.008总步数:200000warmup_learning_rate:0.0预热步骤:5000}}动量优化器值:0.9}use_moving_average: 假}gradient_clipping_by_norm:10.0Fine_tune_checkpoint:预训练模型/mask_rcnn_inception_resnet_v2_1024x1024_coco17_gpu-8/checkpoint/ckpt-0";data_augmentation_options {random_horizo​​ntal_flip {}}}train_input_reader:{tf_record_input_reader {输入路径:注释/train.record";}load_instance_masks: 真掩码类型:PNG_MASKS}评估配置:{metrics_set:coco_detection_metrics";指标集:coco_mask_metrics";eval_instance_masks: 真use_moving_averages: 假批量大小:1include_metrics_per_category: 真}eval_input_reader:{label_map_path:注释/label_map.pbtxt";洗牌:假num_epochs: 1tf_record_input_reader {输入路径:注释/测试.记录";}load_instance_masks: 真掩码类型:PNG_MASKS}

错误中引用的其余 python 脚本可以在此处找到,因为它们不适合单个 StackOverflow 帖子.

解决方案

您可能缺少 train_config{} 中的 fine_tune_checkpoint_version: V2.尝试使用下面的此配置进行自定义修改,

17 Pug>p

I am training a Mask R-CNN Inception ResNet V2 1024x1024 algorithm (on my computer's GPU), as downloaded from the TensorFlow 2 Detection Model Zoo. I am training this algorithm on my custom dataset, which I have labeled using Label-img . When I train the model using the Anaconda command python model_main_tf2.py --model_dir=models/my_faster_rcnn --pipeline_config_path=models/my_faster_rcnn/pipeline.config, I get the following error:

Traceback (most recent call last):
  File "model_main_tf2.py", line 113, in <module>
    tf.compat.v1.app.run()
  File "C:\user\anaconda3\envs\object_detection_api\lib\site-packages\tensorflow\python\platform\app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "C:\user\anaconda3\envs\object_detection_api\lib\site-packages\absl\app.py", line 303, in run
    _run_main(main, args)
  File "C:\user\anaconda3\envs\object_detection_api\lib\site-packages\absl\app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "model_main_tf2.py", line 104, in main
    model_lib_v2.train_loop(
  File "C:\user\anaconda3\envs\object_detection_api\lib\site-packages\object_detection\model_lib_v2.py", line 564, in train_loop
    load_fine_tune_checkpoint(detection_model,
  File "C:\user\anaconda3\envs\object_detection_api\lib\site-packages\object_detection\model_lib_v2.py", line 348, in load_fine_tune_checkpoint
    raise ValueError('Checkpoint version should be V2')
ValueError: Checkpoint version should be V2

What is the code needed to resolve this error? (Below are some scripts referenced in the error):

model_main_tf2.py:

# Lint as: python3
# Copyright 2020 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

r"""Creates and runs TF2 object detection models.

For local training/evaluation run:
PIPELINE_CONFIG_PATH=path/to/pipeline.config
MODEL_DIR=/tmp/model_outputs
NUM_TRAIN_STEPS=10000
SAMPLE_1_OF_N_EVAL_EXAMPLES=1
python model_main_tf2.py -- \
  --model_dir=$MODEL_DIR --num_train_steps=$NUM_TRAIN_STEPS \
  --sample_1_of_n_eval_examples=$SAMPLE_1_OF_N_EVAL_EXAMPLES \
  --pipeline_config_path=$PIPELINE_CONFIG_PATH \
  --alsologtostderr
"""
from absl import flags
import tensorflow.compat.v2 as tf
from object_detection import model_lib_v2

flags.DEFINE_string('pipeline_config_path', None, 'Path to pipeline config '
                    'file.')
flags.DEFINE_integer('num_train_steps', None, 'Number of train steps.')
flags.DEFINE_bool('eval_on_train_data', False, 'Enable evaluating on train '
                  'data (only supported in distributed training).')
flags.DEFINE_integer('sample_1_of_n_eval_examples', None, 'Will sample one of '
                     'every n eval input examples, where n is provided.')
flags.DEFINE_integer('sample_1_of_n_eval_on_train_examples', 5, 'Will sample '
                     'one of every n train input examples for evaluation, '
                     'where n is provided. This is only used if '
                     '`eval_training_data` is True.')
flags.DEFINE_string(
    'model_dir', None, 'Path to output model directory '
                       'where event and checkpoint files will be written.')
flags.DEFINE_string(
    'checkpoint_dir', None, 'Path to directory holding a checkpoint.  If '
    '`checkpoint_dir` is provided, this binary operates in eval-only mode, '
    'writing resulting metrics to `model_dir`.')

flags.DEFINE_integer('eval_timeout', 3600, 'Number of seconds to wait for an'
                     'evaluation checkpoint before exiting.')

flags.DEFINE_bool('use_tpu', False, 'Whether the job is executing on a TPU.')
flags.DEFINE_string(
    'tpu_name',
    default=None,
    help='Name of the Cloud TPU for Cluster Resolvers.')
flags.DEFINE_integer(
    'num_workers', 1, 'When num_workers > 1, training uses '
    'MultiWorkerMirroredStrategy. When num_workers = 1 it uses '
    'MirroredStrategy.')
flags.DEFINE_integer(
    'checkpoint_every_n', 1000, 'Integer defining how often we checkpoint.')
flags.DEFINE_boolean('record_summaries', True,
                     ('Whether or not to record summaries during'
                      ' training.'))

FLAGS = flags.FLAGS


def main(unused_argv):
  flags.mark_flag_as_required('model_dir')
  flags.mark_flag_as_required('pipeline_config_path')
  tf.config.set_soft_device_placement(True)

  if FLAGS.checkpoint_dir:
    model_lib_v2.eval_continuously(
        pipeline_config_path=FLAGS.pipeline_config_path,
        model_dir=FLAGS.model_dir,
        train_steps=FLAGS.num_train_steps,
        sample_1_of_n_eval_examples=FLAGS.sample_1_of_n_eval_examples,
        sample_1_of_n_eval_on_train_examples=(
            FLAGS.sample_1_of_n_eval_on_train_examples),
        checkpoint_dir=FLAGS.checkpoint_dir,
        wait_interval=300, timeout=FLAGS.eval_timeout)
  else:
    if FLAGS.use_tpu:
      # TPU is automatically inferred if tpu_name is None and
      # we are running under cloud ai-platform.
      resolver = tf.distribute.cluster_resolver.TPUClusterResolver(
          FLAGS.tpu_name)
      tf.config.experimental_connect_to_cluster(resolver)
      tf.tpu.experimental.initialize_tpu_system(resolver)
      strategy = tf.distribute.experimental.TPUStrategy(resolver)
    elif FLAGS.num_workers > 1:
      strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
    else:
      strategy = tf.compat.v2.distribute.MirroredStrategy()

    with strategy.scope():
      model_lib_v2.train_loop(
          pipeline_config_path=FLAGS.pipeline_config_path,
          model_dir=FLAGS.model_dir,
          train_steps=FLAGS.num_train_steps,
          use_tpu=FLAGS.use_tpu,
          checkpoint_every_n=FLAGS.checkpoint_every_n,
          record_summaries=FLAGS.record_summaries)

if __name__ == '__main__':
  tf.compat.v1.app.run()

pipeline.config file:

# Mask R-CNN with Inception Resnet v2 (no atrous)
# Sync-trained on COCO (with 8 GPUs) with batch size 16 (1024x1024 resolution)
# Initialized from Imagenet classification checkpoint
# TF2-Compatible, *Not* TPU-Compatible
#
# Achieves XXX mAP on COCO

model {
  faster_rcnn {
    number_of_stages: 3
    num_classes: 1
    image_resizer {
      fixed_shape_resizer {
        height: 1024
        width: 1024
        # pad_to_max_dimension: true
      }
    }
    feature_extractor {
      type: 'faster_rcnn_inception_resnet_v2_keras'
    }
    first_stage_anchor_generator {
      grid_anchor_generator {
        scales: [0.25, 0.5, 1.0, 2.0]
        aspect_ratios: [0.5, 1.0, 2.0]
        height_stride: 16
        width_stride: 16
      }
    }
    first_stage_box_predictor_conv_hyperparams {
      op: CONV
      regularizer {
        l2_regularizer {
          weight: 0.0
        }
      }
      initializer {
        truncated_normal_initializer {
          stddev: 0.01
        }
      }
    }
    first_stage_nms_score_threshold: 0.0
    first_stage_nms_iou_threshold: 0.7
    first_stage_max_proposals: 300
    first_stage_localization_loss_weight: 2.0
    first_stage_objectness_loss_weight: 1.0
    initial_crop_size: 17
    maxpool_kernel_size: 1
    maxpool_stride: 1
    second_stage_box_predictor {
      mask_rcnn_box_predictor {
        use_dropout: false
        dropout_keep_probability: 1.0
        fc_hyperparams {
          op: FC
          regularizer {
            l2_regularizer {
              weight: 0.0
            }
          }
          initializer {
            variance_scaling_initializer {
              factor: 1.0
              uniform: true
              mode: FAN_AVG
            }
          }
        }
        mask_height: 33
        mask_width: 33
        mask_prediction_conv_depth: 0
        mask_prediction_num_conv_layers: 4
        conv_hyperparams {
          op: CONV
          regularizer {
            l2_regularizer {
              weight: 0.0
            }
          }
          initializer {
            truncated_normal_initializer {
              stddev: 0.01
            }
          }
        }
        predict_instance_masks: true
      }
    }
    second_stage_post_processing {
      batch_non_max_suppression {
        score_threshold: 0.0
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SOFTMAX
    }
    second_stage_localization_loss_weight: 2.0
    second_stage_classification_loss_weight: 1.0
    second_stage_mask_prediction_loss_weight: 4.0
    resize_masks: false
  }
}

train_config: {
  batch_size: 1
  num_steps: 200000
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        cosine_decay_learning_rate {
          learning_rate_base: 0.008
          total_steps: 200000
          warmup_learning_rate: 0.0
          warmup_steps: 5000
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  gradient_clipping_by_norm: 10.0
  fine_tune_checkpoint: "pre-trained-models/mask_rcnn_inception_resnet_v2_1024x1024_coco17_gpu-8/checkpoint/ckpt-0"
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "annotations/train.record"
  }
  load_instance_masks: true
  mask_type: PNG_MASKS
}

eval_config: {
  metrics_set: "coco_detection_metrics"
  metrics_set: "coco_mask_metrics"
  eval_instance_masks: true
  use_moving_averages: false
  batch_size: 1
  include_metrics_per_category: true
}

eval_input_reader: {
  label_map_path: "annotations/label_map.pbtxt"
  shuffle: false
  num_epochs: 1
  tf_record_input_reader {
    input_path: "annotations/test.record"
  }
  load_instance_masks: true
  mask_type: PNG_MASKS
}

The rest of the python scripts referenced in the error can be found here, as they would not fit in a single StackOverflow post.

解决方案

You may be missing fine_tune_checkpoint_version: V2 in train_config{}. Try custom modifications with this config below,

https://github.com/tensorflow/models/blob/6d6a78a259d4929b7f00d97aa5bbee7588463abd/research/object_detection/configs/tf2/mask_rcnn_inception_resnet_v2_1024x1024_coco17_gpu-8.config#L124

这篇关于TensorFlow - ValueError:检查点版本应该是 V2的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆