如何在Object Detection API中同时进行训练和评估? [英] How to train and evaluate simultaneously in Object Detection API ?

查看:123
本文介绍了如何在Object Detection API中同时进行训练和评估?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想同时在Object Detection API中对自己的数据集中的ssd_mobile_v1_coco进行训练/评估.

I want to have train/evaluate the ssd_mobile_v1_coco on my own dataset at the same time in Object Detection API.

但是,当我只是尝试这样做时,我面临着GPU内存即将满的情况,因此评估脚本无法启动. 以下是我用于训练然后评估的命令:
像这样在一个终端窗格中调用培训脚本:

However, when I simply try to do so, I am faced with GPU memory being nearly full and thus the evaluation script fails to start. Here are the commands I use for training and then evaluation:
Training script is called in one terminal pane like this :

python3 train.py \
        --logtostderr \
        --train_dir=training_ssd_mobile_caltech \
        --pipeline_config_path=ssd_mobilenet_v1_coco_2017_11_17/ssd_mobilenet_v1_focal_loss_coco.config

运行良好,可以进行培训……然后,我尝试在第二个终端窗格中运行评估脚本:

That runs fine, training works... then I try to run the evaluation script in the second terminal pane :

 python3 eval.py \
        --logtostderr \
        --checkpoint_dir=training_ssd_mobile_caltech \
        --eval_dir=eval_caltech \
        --pipeline_config_path=ssd_mobilenet_v1_coco_2017_11_17/ssd_mobilenet_v1_focal_loss_coco.config 

它失败并显示以下错误:

It fails with the following error :

python3 eval.py \
        --logtostderr \
        --checkpoint_dir=training_ssd_mobile_caltech \
        --eval_dir=eval_caltech \
       --pipeline_config_path=ssd_mobilenet_v1_coco_2017_11_17/ssd_mobilenet_v1_focal_loss_coco.config 
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
2018-02-28 18:40:00.302271: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-02-28 18:40:00.412808: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-02-28 18:40:00.413217: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties: 
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.835
pciBusID: 0000:01:00.0
totalMemory: 7.92GiB freeMemory: 93.00MiB
2018-02-28 18:40:00.413424: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-02-28 18:40:00.957090: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 43.00M (45088768 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2018-02-28 18:40:00.957919: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 38.70M (40580096 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
INFO:tensorflow:Restoring parameters from training_ssd_mobile_caltech/model.ckpt-4775
INFO:tensorflow:Restoring parameters from training_ssd_mobile_caltech/model.ckpt-4775
2018-02-28 18:40:02.274830: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 8.17M (8566528 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2018-02-28 18:40:02.278599: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 8.17M (8566528 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2018-02-28 18:40:12.280515: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 8.17M (8566528 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2018-02-28 18:40:12.281958: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 8.17M (8566528 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2018-02-28 18:40:12.282082: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.75MiB.  Current allocation summary follows.
2018-02-28 18:40:12.282160: I tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (256):   Total Chunks: 190, Chunks in use: 190. 47.5KiB allocated for chunks. 47.5KiB in use in bin. 11.8KiB client-requested in use in bin.
2018-02-28 18:40:12.282251: I tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (512):   Total Chunks: 70, Chunks in use: 70. 35.0KiB allocated for chunks. 35.0KiB in use in bin. 35.0KiB client-requested in use in bin.
[.......................................]2018-02-28 18:40:12.290959: I tensorflow/core/common_runtime/bfc_allocator.cc:684] Sum Total of in-use chunks: 29.83MiB
2018-02-28 18:40:12.290971: I tensorflow/core/common_runtime/bfc_allocator.cc:686] Stats: 
Limit:                    45088768
InUse:                    31284736
MaxInUse:                 32368384
NumAllocs:                     808
MaxAllocSize:              5796864

2018-02-28 18:40:12.291022: W tensorflow/core/common_runtime/bfc_allocator.cc:277] **********************xx*********xx**_*__****______***********************************************xx
2018-02-28 18:40:12.291044: W tensorflow/core/framework/op_kernel.cc:1198] Resource exhausted: OOM when allocating tensor with shape[1,32,150,150] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
WARNING:root:The following classes have no ground truth examples: 1
/home/mm/models/research/object_detection/utils/metrics.py:144: RuntimeWarning: invalid value encountered in true_divide
  num_images_correctly_detected_per_class / num_gt_imgs_per_class)
/home/mm/models/research/object_detection/utils/object_detection_evaluation.py:710: RuntimeWarning: Mean of empty slice
  mean_ap = np.nanmean(self.average_precision_per_class)
/home/mm/models/research/object_detection/utils/object_detection_evaluation.py:711: RuntimeWarning: Mean of empty slice
  mean_corloc = np.nanmean(self.corloc_per_class)
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1329, in _run_fn
    status, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,32,150,150] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[Node: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 2, 2, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Preprocessor/sub, FeatureExtractor/MobilenetV1/Conv2d_0/weights/read)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[Node: Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/ClipToWindow/Gather/Gather_1/_469 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1068_Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/ClipToWindow/Gather/Gather_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "eval.py", line 146, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 124, in run
    _sys.exit(main(argv))
  File "eval.py", line 142, in main
    FLAGS.checkpoint_dir, FLAGS.eval_dir)
  File "/home/mm/models/research/object_detection/evaluator.py", line 240, in evaluate
    save_graph_dir=(eval_dir if eval_config.save_graph else ''))
  File "/home/mm/models/research/object_detection/eval_util.py", line 407, in repeated_checkpoint_run
    save_graph_dir)
  File "/home/mm/models/research/object_detection/eval_util.py", line 286, in _run_checkpoint_once
    result_dict = batch_processor(tensor_dict, sess, batch, counters)
  File "/home/mm/models/research/object_detection/evaluator.py", line 183, in _process_batch
    result_dict = sess.run(tensor_dict)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 895, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1128, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1344, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1363, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,32,150,150] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[Node: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 2, 2, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Preprocessor/sub, FeatureExtractor/MobilenetV1/Conv2d_0/weights/read)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[Node: Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/ClipToWindow/Gather/Gather_1/_469 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1068_Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/ClipToWindow/Gather/Gather_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


Caused by op 'FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/Conv2D', defined at:
  File "eval.py", line 146, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 124, in run
    _sys.exit(main(argv))
  File "eval.py", line 142, in main
    FLAGS.checkpoint_dir, FLAGS.eval_dir)
  File "/home/mm/models/research/object_detection/evaluator.py", line 161, in evaluate
    ignore_groundtruth=eval_config.ignore_groundtruth)
  File "/home/mm/models/research/object_detection/evaluator.py", line 72, in _extract_prediction_tensors
    prediction_dict = model.predict(preprocessed_image, true_image_shapes)
  File "/home/mm/models/research/object_detection/meta_architectures/ssd_meta_arch.py", line 334, in predict
    preprocessed_inputs)
  File "/home/mm/models/research/object_detection/models/ssd_mobilenet_v1_feature_extractor.py", line 112, in extract_features
    scope=scope)
  File "/home/mm/models/research/slim/nets/mobilenet_v1.py", line 232, in mobilenet_v1_base
    scope=end_point)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args
    return func(*args, **current_args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1057, in convolution
    outputs = layer.apply(inputs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/layers/base.py", line 762, in apply
    return self.__call__(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/layers/base.py", line 652, in __call__
    outputs = self.call(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/layers/convolutional.py", line 167, in call
    outputs = self._convolution_op(inputs, self.kernel)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_ops.py", line 838, in __call__
    return self.conv_op(inp, filter)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_ops.py", line 502, in __call__
    return self.call(inp, filter)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_ops.py", line 190, in __call__
    name=self.name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 639, in conv2d
    data_format=data_format, dilations=dilations, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3160, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1625, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,32,150,150] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[Node: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 2, 2, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Preprocessor/sub, FeatureExtractor/MobilenetV1/Conv2d_0/weights/read)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[Node: Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/ClipToWindow/Gather/Gather_1/_469 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1068_Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/ClipToWindow/Gather/Gather_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

在开始eval.py TF培训之前,已预先分配了所有GPU内存,因此我无法弄清楚如何同时运行它们或至少让ODA以特定的间隔运行评估.

Prior to initiating the eval.py TF training has all the GPU memory allocated in advance and therefore I cant figure out how to have both of them running at the same time, or at least have the ODA, run evaluation in specific intervals.

因此,可以首先在培训与评估的同时进行评估吗?如果是这样,如何完成?

Therefore Is it possible in first place to have evaluation run simultaneously with training? if so How is it done ?

系统信息

您正在使用的模型的顶级目录是什么: object_detection

What is the top-level directory of the model you are using: object_detection

我是否已编写自定义代码:尚未...

Have I written custom code: not yet...

OS平台和发行版: Linux Ubuntu 16.04 LTS

OS Platform and Distribution : Linux Ubuntu 16.04 LTS

从(源或二进制)安装的TensorFlow: pip3 tensorflow-gpu

TensorFlow installed from (source or binary): pip3 tensorflow-gpu

TensorFlow版本(使用下面的命令): 1.5.0

TensorFlow version (use command below): 1.5.0

CUDA/cuDNN版本: 9.0/7.0

CUDA/cuDNN version: 9.0/7.0

GPU型号和内存: GTX 1080,8Gb

GPU model and memory: GTX 1080, 8Gb

推荐答案

要强制eval作业在您的CPU上运行(并防止它占用宝贵的GPU内存),请创建一个virtualenv并在其中安装tensorflow-gpu用于训练(例如,命名为virtual_tf_gpu),另一种用于在没有gpu支持的情况下安装tensorflow(例如,virtual_tf).在两个单独的终端窗口中激活两个虚拟环境,并在GPU支持的环境中开始培训,并在CPU支持的环境中进行评估.

To force the eval job to run on your CPU (and prevent it from taking precious GPU-memory), create one virtualenv where you install tensorflow-gpu which you use for training (named e.g. virtual_tf_gpu), and another one where you install tensorflow WITHOUT gpu support (e.g. virtual_tf). Activate your two virtualenvs in two separate terminal windows and start training in your GPU-supported environment and evaluation in your CPU-supported environment.

祝你好运!!!

这篇关于如何在Object Detection API中同时进行训练和评估?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆