TensorFlow - ValueError:检查点版本应该是 V2 [英] TensorFlow - ValueError: Checkpoint version should be V2
问题描述
- GPU:NVIDIA GEFORCE RTX 2060
- GPU:16GB RAM,6 个处理器内核
- TensorFlow:2.3.1
- Python:3.8.6
- CUDA:10.1
- cuDNN:7.6
我正在训练一个 Mask R-CNN Inception ResNet V2 1024x1024 算法(在我电脑的 GPU 上),从 TensorFlow 2 Detection Model Zoo. 我正在我的自定义数据集上训练这个算法,我使用 标签-img.当我使用 Anaconda 命令 python model_main_tf2.py --model_dir=models/my_faster_rcnn --pipeline_config_path=models/my_faster_rcnn/pipeline.config
训练模型时,出现以下错误:
回溯(最近一次调用最后一次):文件model_main_tf2.py",第 113 行,在 <module> 中.tf.compat.v1.app.run()文件C:\user\anaconda3\envs\object_detection_api\lib\site-packages\tensorflow\python\platform\app.py",第 40 行,运行中_run(main=main,argv=argv,flags_parser=_parse_flags_tolerate_undef)文件C:\user\anaconda3\envs\object_detection_api\lib\site-packages\absl\app.py",第303行,运行中_run_main(main, args)文件C:\user\anaconda3\envs\object_detection_api\lib\site-packages\absl\app.py",第 251 行,在 _run_main 中sys.exit(main(argv))文件model_main_tf2.py",第 104 行,在主目录中model_lib_v2.train_loop(文件C:\user\anaconda3\envs\object_detection_api\lib\site-packages\object_detection\model_lib_v2.py",第564行,train_loopload_fine_tune_checkpoint(检测模型,文件C:\user\anaconda3\envs\object_detection_api\lib\site-packages\object_detection\model_lib_v2.py",第348行,load_fine_tune_checkpointraise ValueError('检查点版本应该是 V2')ValueError:检查点版本应该是 V2
解决此错误所需的代码是什么?(以下是错误中引用的一些脚本):
model_main_tf2.py:
# Lint 为:python3# 版权所有 2020 TensorFlow 作者.版权所有.## 根据 Apache 许可,2.0 版(许可")获得许可;# 除非遵守许可,否则您不得使用此文件.# 您可以在以下网址获得许可证的副本## http://www.apache.org/licenses/LICENSE-2.0## 除非适用法律要求或书面同意,否则软件# 根据许可证分发是按原样"分发的;基础,# 没有任何形式的明示或暗示的保证或条件.# 请参阅许可证以了解管理权限的特定语言和# 许可证下的限制.# ================================================================================r"创建并运行TF2对象检测模型.对于本地培训/评估运行:PIPELINE_CONFIG_PATH=path/to/pipeline.configMODEL_DIR=/tmp/model_outputsNUM_TRAIN_STEPS=10000SAMPLE_1_OF_N_EVAL_EXAMPLES=1python model_main_tf2.py -- \--model_dir=$MODEL_DIR --num_train_steps=$NUM_TRAIN_STEPS \--sample_1_of_n_eval_examples=$SAMPLE_1_OF_N_EVAL_EXAMPLES \--pipeline_config_path=$PIPELINE_CONFIG_PATH \--alsologtostderr"从 absl 导入标志将 tensorflow.compat.v2 导入为 tf从 object_detection 导入 model_lib_v2flags.DEFINE_string('pipeline_config_path', None, '管道配置路径''文件.')flags.DEFINE_integer('num_train_steps', None, '列车步数.')flags.DEFINE_bool('eval_on_train_data', False, '在火车上启用评估''数据(仅在分布式训练中支持).')flags.DEFINE_integer('sample_1_of_n_eval_examples', None, '将采样其中之一''每 n 个 eval 输入示例,其中提供了 n.')flags.DEFINE_integer('sample_1_of_n_eval_on_train_examples', 5, '将采样''每 n 个用于评估的训练输入示例之一,''其中提供了 n.这仅在''eval_training_data' 为真.')flags.DEFINE_string('model_dir', None, '输出模型目录的路径''将写入事件和检查点文件的位置.')flags.DEFINE_string('checkpoint_dir', None, '保存检查点的目录的路径.如果 ''`checkpoint_dir` 提供,这个二进制文件在 eval-only 模式下运行,''将结果指标写入`model_dir`.')flags.DEFINE_integer('eval_timeout', 3600, '等待的秒数''退出前的评估检查点.')flags.DEFINE_bool('use_tpu', False, '作业是否在 TPU 上执行.')flags.DEFINE_string('tpu_name',默认=无,help='用于集群解析器的 Cloud TPU 的名称.')flags.DEFINE_integer('num_workers', 1, '当 num_workers >1、训练用''MultiWorkerMirroredStrategy.当 num_workers = 1 时,它使用 ''镜像策略.')flags.DEFINE_integer('checkpoint_every_n', 1000, '整数定义我们检查点的频率.')flags.DEFINE_boolean('record_summaries', True,('是否在期间记录摘要'' 训练.'))标志 = flags.FLAGSdef main(unused_argv):flags.mark_flag_as_required('model_dir')flags.mark_flag_as_required('pipeline_config_path')tf.config.set_soft_device_placement(真)如果 FLAGS.checkpoint_dir:model_lib_v2.eval_continuously(pipeline_config_path=FLAGS.pipeline_config_path,model_dir=FLAGS.model_dir,train_steps=FLAGS.num_train_steps,sample_1_of_n_eval_examples=FLAGS.sample_1_of_n_eval_examples,sample_1_of_n_eval_on_train_examples=(FLAGS.sample_1_of_n_eval_on_train_examples),checkpoint_dir=FLAGS.checkpoint_dir,wait_interval=300, timeout=FLAGS.eval_timeout)别的:如果 FLAGS.use_tpu:# 如果 tpu_name 为 None 和# 我们在云人工智能平台下运行.解析器 = tf.distribute.cluster_resolver.TPUClusterResolver(FLAGS.tpu_name)tf.config.experimental_connect_to_cluster(解析器)tf.tpu.experimental.initialize_tpu_system(解析器)策略 = tf.distribute.experimental.TPUStrategy(resolver)elif FLAGS.num_workers >1:策略 = tf.distribute.experimental.MultiWorkerMirroredStrategy()别的:策略 = tf.compat.v2.distribute.MirroredStrategy()使用 strategy.scope():model_lib_v2.train_loop(pipeline_config_path=FLAGS.pipeline_config_path,model_dir=FLAGS.model_dir,train_steps=FLAGS.num_train_steps,use_tpu=FLAGS.use_tpu,checkpoint_every_n=FLAGS.checkpoint_every_n,record_summaries=FLAGS.record_summaries)如果 __name__ == '__main__':tf.compat.v1.app.run()
pipeline.config 文件:
# Mask R-CNN with Inception Resnet v2(无孔)# 在 COCO(8 个 GPU)上同步训练,批量大小为 16(1024x1024 分辨率)# 从 Imagenet 分类检查点初始化# TF2 兼容,*不* TPU 兼容## 在 COCO 上实现 XXX mAP模型 {fast_rcnn {number_of_stages: 3num_classes: 1image_resizer {fixed_shape_resizer {高度:1024宽度:1024# pad_to_max_dimension: 真}}特征提取器{类型:'faster_rcnn_inception_resnet_v2_keras'}first_stage_anchor_generator {grid_anchor_generator {尺度:[0.25, 0.5, 1.0, 2.0]纵横比:[0.5, 1.0, 2.0]高度_步幅:16宽度_步幅:16}}first_stage_box_predictor_conv_hyperparams {操作:转换正则化{l2_regularizer {重量:0.0}}初始化程序{truncated_normal_initializer {标准差:0.01}}}first_stage_nms_score_threshold: 0.0first_stage_nms_iou_threshold:0.7first_stage_max_proposals: 300first_stage_localization_loss_weight:2.0first_stage_objectness_loss_weight:1.0initial_crop_size: 17maxpool_kernel_size: 1最大池步幅:1second_stage_box_predictor {mask_rcnn_box_predictor {use_dropout: 假dropout_keep_probability: 1.0fc_hyperparams {操作:FC正则化{l2_regularizer {重量:0.0}}初始化程序{方差_缩放_初始化器{系数:1.0制服:真的模式:FAN_AVG}}}掩码高度:33掩码宽度:33mask_prediction_conv_depth: 0mask_prediction_num_conv_layers: 4conv_hyperparams {操作:转换正则化{l2_regularizer {重量:0.0}}初始化程序{truncated_normal_initializer {标准差:0.01}}}predict_instance_masks:真}}second_stage_post_processing {batch_non_max_suppression {分数阈值:0.0iou_threshold:0.6max_detections_per_class:100max_total_detections: 100}score_converter: SOFTMAX}second_stage_localization_loss_weight:2.0second_stage_classification_loss_weight: 1.0second_stage_mask_prediction_loss_weight:4.0resize_masks: 假}}火车配置:{批量大小:1num_steps: 200000优化器{动量优化器:{学习率:{cosine_decay_learning_rate {learning_rate_base: 0.008总步数:200000warmup_learning_rate:0.0预热步骤:5000}}动量优化器值:0.9}use_moving_average: 假}gradient_clipping_by_norm:10.0Fine_tune_checkpoint:预训练模型/mask_rcnn_inception_resnet_v2_1024x1024_coco17_gpu-8/checkpoint/ckpt-0";data_augmentation_options {random_horizontal_flip {}}}train_input_reader:{tf_record_input_reader {输入路径:注释/train.record";}load_instance_masks: 真掩码类型:PNG_MASKS}评估配置:{metrics_set:coco_detection_metrics";指标集:coco_mask_metrics";eval_instance_masks: 真use_moving_averages: 假批量大小:1include_metrics_per_category: 真}eval_input_reader:{label_map_path:注释/label_map.pbtxt";洗牌:假num_epochs: 1tf_record_input_reader {输入路径:注释/测试.记录";}load_instance_masks: 真掩码类型:PNG_MASKS}
错误中引用的其余 python 脚本可以在此处找到,因为它们不适合单个 StackOverflow 帖子.
您可能缺少 train_config{}
中的 fine_tune_checkpoint_version: V2
.尝试使用下面的此配置进行自定义修改,
17 Pug>p I am training a Mask R-CNN Inception ResNet V2 1024x1024 algorithm (on my computer's GPU), as downloaded from the TensorFlow 2 Detection Model Zoo. I am training this algorithm on my custom dataset, which I have labeled using Label-img . When I train the model using the Anaconda command What is the code needed to resolve this error? (Below are some scripts referenced in the error): model_main_tf2.py: pipeline.config file: The rest of the python scripts referenced in the error can be found here, as they would not fit in a single StackOverflow post. You may be missing 这篇关于TensorFlow - ValueError:检查点版本应该是 V2的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
python model_main_tf2.py --model_dir=models/my_faster_rcnn --pipeline_config_path=models/my_faster_rcnn/pipeline.config
, I get the following error:Traceback (most recent call last):
File "model_main_tf2.py", line 113, in <module>
tf.compat.v1.app.run()
File "C:\user\anaconda3\envs\object_detection_api\lib\site-packages\tensorflow\python\platform\app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "C:\user\anaconda3\envs\object_detection_api\lib\site-packages\absl\app.py", line 303, in run
_run_main(main, args)
File "C:\user\anaconda3\envs\object_detection_api\lib\site-packages\absl\app.py", line 251, in _run_main
sys.exit(main(argv))
File "model_main_tf2.py", line 104, in main
model_lib_v2.train_loop(
File "C:\user\anaconda3\envs\object_detection_api\lib\site-packages\object_detection\model_lib_v2.py", line 564, in train_loop
load_fine_tune_checkpoint(detection_model,
File "C:\user\anaconda3\envs\object_detection_api\lib\site-packages\object_detection\model_lib_v2.py", line 348, in load_fine_tune_checkpoint
raise ValueError('Checkpoint version should be V2')
ValueError: Checkpoint version should be V2
# Lint as: python3
# Copyright 2020 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
r"""Creates and runs TF2 object detection models.
For local training/evaluation run:
PIPELINE_CONFIG_PATH=path/to/pipeline.config
MODEL_DIR=/tmp/model_outputs
NUM_TRAIN_STEPS=10000
SAMPLE_1_OF_N_EVAL_EXAMPLES=1
python model_main_tf2.py -- \
--model_dir=$MODEL_DIR --num_train_steps=$NUM_TRAIN_STEPS \
--sample_1_of_n_eval_examples=$SAMPLE_1_OF_N_EVAL_EXAMPLES \
--pipeline_config_path=$PIPELINE_CONFIG_PATH \
--alsologtostderr
"""
from absl import flags
import tensorflow.compat.v2 as tf
from object_detection import model_lib_v2
flags.DEFINE_string('pipeline_config_path', None, 'Path to pipeline config '
'file.')
flags.DEFINE_integer('num_train_steps', None, 'Number of train steps.')
flags.DEFINE_bool('eval_on_train_data', False, 'Enable evaluating on train '
'data (only supported in distributed training).')
flags.DEFINE_integer('sample_1_of_n_eval_examples', None, 'Will sample one of '
'every n eval input examples, where n is provided.')
flags.DEFINE_integer('sample_1_of_n_eval_on_train_examples', 5, 'Will sample '
'one of every n train input examples for evaluation, '
'where n is provided. This is only used if '
'`eval_training_data` is True.')
flags.DEFINE_string(
'model_dir', None, 'Path to output model directory '
'where event and checkpoint files will be written.')
flags.DEFINE_string(
'checkpoint_dir', None, 'Path to directory holding a checkpoint. If '
'`checkpoint_dir` is provided, this binary operates in eval-only mode, '
'writing resulting metrics to `model_dir`.')
flags.DEFINE_integer('eval_timeout', 3600, 'Number of seconds to wait for an'
'evaluation checkpoint before exiting.')
flags.DEFINE_bool('use_tpu', False, 'Whether the job is executing on a TPU.')
flags.DEFINE_string(
'tpu_name',
default=None,
help='Name of the Cloud TPU for Cluster Resolvers.')
flags.DEFINE_integer(
'num_workers', 1, 'When num_workers > 1, training uses '
'MultiWorkerMirroredStrategy. When num_workers = 1 it uses '
'MirroredStrategy.')
flags.DEFINE_integer(
'checkpoint_every_n', 1000, 'Integer defining how often we checkpoint.')
flags.DEFINE_boolean('record_summaries', True,
('Whether or not to record summaries during'
' training.'))
FLAGS = flags.FLAGS
def main(unused_argv):
flags.mark_flag_as_required('model_dir')
flags.mark_flag_as_required('pipeline_config_path')
tf.config.set_soft_device_placement(True)
if FLAGS.checkpoint_dir:
model_lib_v2.eval_continuously(
pipeline_config_path=FLAGS.pipeline_config_path,
model_dir=FLAGS.model_dir,
train_steps=FLAGS.num_train_steps,
sample_1_of_n_eval_examples=FLAGS.sample_1_of_n_eval_examples,
sample_1_of_n_eval_on_train_examples=(
FLAGS.sample_1_of_n_eval_on_train_examples),
checkpoint_dir=FLAGS.checkpoint_dir,
wait_interval=300, timeout=FLAGS.eval_timeout)
else:
if FLAGS.use_tpu:
# TPU is automatically inferred if tpu_name is None and
# we are running under cloud ai-platform.
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(
FLAGS.tpu_name)
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver)
elif FLAGS.num_workers > 1:
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
else:
strategy = tf.compat.v2.distribute.MirroredStrategy()
with strategy.scope():
model_lib_v2.train_loop(
pipeline_config_path=FLAGS.pipeline_config_path,
model_dir=FLAGS.model_dir,
train_steps=FLAGS.num_train_steps,
use_tpu=FLAGS.use_tpu,
checkpoint_every_n=FLAGS.checkpoint_every_n,
record_summaries=FLAGS.record_summaries)
if __name__ == '__main__':
tf.compat.v1.app.run()
# Mask R-CNN with Inception Resnet v2 (no atrous)
# Sync-trained on COCO (with 8 GPUs) with batch size 16 (1024x1024 resolution)
# Initialized from Imagenet classification checkpoint
# TF2-Compatible, *Not* TPU-Compatible
#
# Achieves XXX mAP on COCO
model {
faster_rcnn {
number_of_stages: 3
num_classes: 1
image_resizer {
fixed_shape_resizer {
height: 1024
width: 1024
# pad_to_max_dimension: true
}
}
feature_extractor {
type: 'faster_rcnn_inception_resnet_v2_keras'
}
first_stage_anchor_generator {
grid_anchor_generator {
scales: [0.25, 0.5, 1.0, 2.0]
aspect_ratios: [0.5, 1.0, 2.0]
height_stride: 16
width_stride: 16
}
}
first_stage_box_predictor_conv_hyperparams {
op: CONV
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
truncated_normal_initializer {
stddev: 0.01
}
}
}
first_stage_nms_score_threshold: 0.0
first_stage_nms_iou_threshold: 0.7
first_stage_max_proposals: 300
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 17
maxpool_kernel_size: 1
maxpool_stride: 1
second_stage_box_predictor {
mask_rcnn_box_predictor {
use_dropout: false
dropout_keep_probability: 1.0
fc_hyperparams {
op: FC
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
variance_scaling_initializer {
factor: 1.0
uniform: true
mode: FAN_AVG
}
}
}
mask_height: 33
mask_width: 33
mask_prediction_conv_depth: 0
mask_prediction_num_conv_layers: 4
conv_hyperparams {
op: CONV
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
truncated_normal_initializer {
stddev: 0.01
}
}
}
predict_instance_masks: true
}
}
second_stage_post_processing {
batch_non_max_suppression {
score_threshold: 0.0
iou_threshold: 0.6
max_detections_per_class: 100
max_total_detections: 100
}
score_converter: SOFTMAX
}
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
second_stage_mask_prediction_loss_weight: 4.0
resize_masks: false
}
}
train_config: {
batch_size: 1
num_steps: 200000
optimizer {
momentum_optimizer: {
learning_rate: {
cosine_decay_learning_rate {
learning_rate_base: 0.008
total_steps: 200000
warmup_learning_rate: 0.0
warmup_steps: 5000
}
}
momentum_optimizer_value: 0.9
}
use_moving_average: false
}
gradient_clipping_by_norm: 10.0
fine_tune_checkpoint: "pre-trained-models/mask_rcnn_inception_resnet_v2_1024x1024_coco17_gpu-8/checkpoint/ckpt-0"
data_augmentation_options {
random_horizontal_flip {
}
}
}
train_input_reader: {
tf_record_input_reader {
input_path: "annotations/train.record"
}
load_instance_masks: true
mask_type: PNG_MASKS
}
eval_config: {
metrics_set: "coco_detection_metrics"
metrics_set: "coco_mask_metrics"
eval_instance_masks: true
use_moving_averages: false
batch_size: 1
include_metrics_per_category: true
}
eval_input_reader: {
label_map_path: "annotations/label_map.pbtxt"
shuffle: false
num_epochs: 1
tf_record_input_reader {
input_path: "annotations/test.record"
}
load_instance_masks: true
mask_type: PNG_MASKS
}
fine_tune_checkpoint_version: V2
in train_config{}
. Try custom modifications with this config below,