Tensorflow 对象检测 API 被终止 - OOM.如何减少随机缓冲区大小? [英] Tensorflow object detection API killed - OOM. How to reduce shuffle buffer size?

查看:31
本文介绍了Tensorflow 对象检测 API 被终止 - OOM.如何减少随机缓冲区大小?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  • 操作系统平台和发行版:CentOS 7.5.1804
  • TensorFlow 安装自:pip install tensorflow-gpu
  • TensorFlow 版本:tensorflow-gpu 1.8.0
  • CUDA/cuDNN 版本:9.0/7.1.2
  • GPU 型号和内存:GeForce GTX 1080 Ti,11264MB
  • 重现的确切命令:

  • OS Platform and Distribution: CentOS 7.5.1804
  • TensorFlow installed from: pip install tensorflow-gpu
  • TensorFlow version: tensorflow-gpu 1.8.0
  • CUDA/cuDNN version: 9.0/7.1.2
  • GPU model and memory: GeForce GTX 1080 Ti, 11264MB
  • Exact command to reproduce:

python train.py --logtostderr --train_dir=./models/train --pipeline_config_path=mask_rcnn_inception_v2_coco.config

python train.py --logtostderr --train_dir=./models/train --pipeline_config_path=mask_rcnn_inception_v2_coco.config

我试图在我自己的数据集上训练一个 Mask-RCNN 模型(从在 COCO 上训练的模型进行微调),但是一旦 shuffle 缓冲区被填满,这个过程就会被终止.

I am attempting to train a Mask-RCNN model on my own dataset (fine tuning from a model trained on COCO), but the process is killed as soon as the shuffle buffer is filled.

在此之前,nvidia-smi 显示内存使用量约为 10669MB/11175MB,但 GPU 使用率仅为 1%.

Before this happens, nvidia-smi shows memory usage of around 10669MB/11175MB but only 1% GPU utilisation.

我尝试调整以下 train_config 设置:

I have tried adjusting the following train_config settings:

batch_size: 1    
batch_queue_capacity: 10    
num_batch_queue_threads: 4    
prefetch_queue_capacity: 5

对于 train_input_reader:

And for train_input_reader:

num_readers: 1
queue_capacity: 10
min_after_dequeue: 5

我相信我的问题类似于 TensorFlow 对象检测 API - 内存不足 但我使用的是 GPU 而不是 CPU.

I believe my problem is similar to TensorFlow Object Detection API - Out of Memory but I am using a GPU rather than CPU-only.

我正在训练的图像比较大(2048*2048),但是我想避免缩小尺寸,因为要检测的对象非常小.我的训练集包含 400 张图像(在 .tfrecord 文件中).

The images I am training on are comparatively large (2048*2048), however I would like to avoid downsizing as the objects to be detected are quite small. My training set consists of 400 images (in a .tfrecord file).

有没有办法减小shuffle buffer的大小,看看这是否降低了内存需求?

Is there a way to reduce the size of the shuffle buffer to see if this reduces the memory requirement?

INFO:tensorflow:Restoring parameters from ./models/train/model.ckpt-0
INFO:tensorflow:Restoring parameters from ./models/train/model.ckpt-0
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path ./models/train/model.ckpt
INFO:tensorflow:Saving checkpoint to path ./models/train/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:global_step/sec: 0
2018-06-19 12:21:33.487840: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:94] Filling up shuffle buffer (this may take a while): 97 of 2048
2018-06-19 12:21:43.547326: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:94] Filling up shuffle buffer (this may take a while): 231 of 2048
2018-06-19 12:21:53.470634: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:94] Filling up shuffle buffer (this may take a while): 381 of 2048
2018-06-19 12:21:57.030494: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:129] Shuffle buffer filled.
Killed

推荐答案

您可以尝试以下步骤:

1.设置 batch_size=1(或尝试自己的)

1.Set batch_size=1 (or try your own)

2.Change "default value": optional uint32 shuffle_buffer_size = 11 [default = 256] (或者自己尝试)代码在这里

2.Change "default value": optional uint32 shuffle_buffer_size = 11 [default = 256] (or try your own) the code is here

models/research/object_detection/protos/input_reader.proto

Line 40 in ce03903

 optional uint32 shuffle_buffer_size = 11 [default = 2048];

原始集是:

optional uint32 shuffle_buffer_size = 11 [default = 2048]

默认值是2048,对于batch_size=1来说太大了,应该做相应的修改,我认为它消耗了很多RAM.

the default value is 2048, it's too big for batch_size=1, should be modified accordingly, it consumes a lot of RAM in my opinion.

3.重新编译Protobuf库

3.Recompile Protobuf libraries

来自 tensorflow/models/research/

From tensorflow/models/research/

protoc object_detection/protos/*.proto --python_out=.

这篇关于Tensorflow 对象检测 API 被终止 - OOM.如何减少随机缓冲区大小?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆