Tensorflow 对象检测训练被杀死,资源匮乏? [英] Tensorflow Object Detection Training Killed, Resource starvation?

查看:48
本文介绍了Tensorflow 对象检测训练被杀死,资源匮乏?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此处已部分询问此问题和 这里 没有跟进,所以也许这里不是问这个问题,但我想出了更多的信息,希望能得到这些问题的答案.

This question has partially been asked here and here with no follow-ups, so maybe this is not the venue to ask this question, but I've figured out a little more information that I'm hoping might get an answer to these questions.

我一直在尝试在我自己的大约 1k 张照片库上训练 object_detection.我一直在使用提供的管道配置文件ssd_inception_v2_pets.config".我相信我已经正确设置了训练数据.该程序似乎开始训练就好了.当它无法读取数据时,它会发出错误警报,我修复了这个问题.

I've been attempting to train object_detection on my own library of roughly 1k photos. I've been using the provided pipeline config file "ssd_inception_v2_pets.config". And I've set up the training data properly, I believe. The program appears to start training just fine. When it couldn't read the data, it alerted with an error, and I fixed that.

我的 train_config 设置如下,但我更改了一些数字以尝试让它以更少的资源运行.

My train_config settings are as follows, though I've changed a few of the numbers in order to try and get it to run with fewer resources.

train_config: {
  batch_size: 1000 #also tried 1, 10, and 100
  optimizer {
    rms_prop_optimizer: {
      learning_rate: {
        exponential_decay_learning_rate {
          initial_learning_rate: 0.04  # also tried .004
          decay_steps: 800 # also tried 800720. 80072
          decay_factor: 0.95
        }
      }
      momentum_optimizer_value: 0.9
      decay: 0.9
      epsilon: 1.0
    }
  }
  fine_tune_checkpoint: "~/Downloads/ssd_inception_v2_coco_11_06_2017/model.ckpt" #using inception checkpoint
  from_detection_checkpoint: true
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    ssd_random_crop {
    }
  }
}

基本上,我认为正在发生的事情是计算机的资源消耗非常快,我想知道是否有人进行了需要更多时间来构建但使用更少资源的优化?

Basically, what I think is happening is that the computer is getting resource starved very quickly, and I'm wondering if anyone has an optimization that takes more time to build, but uses fewer resources?

或者我是否错误地解释了为什么进程会被杀死,有没有办法让我从内核中获取更多相关信息?

OR am I wrong about why the process is getting killed, and is there a way for me to get more information about that from the kernel?

这是进程被kill后得到的Dmesg信息.

This is the Dmesg information that I get after the process is killed.

[711708.975215] Out of memory: Kill process 22087 (python) score 517 or sacrifice child
[711708.975221] Killed process 22087 (python) total-vm:9086536kB, anon-rss:6114136kB, file-rss:24kB, shmem-rss:0kB

推荐答案

我遇到了和你一样的问题.实际上,内存满是由data_augmentation_options ssd_random_crop引起的,所以你可以去掉这个选项并将batch size设置为8或更小,即2,4.当我将batch size设置为1时,我也遇到了一些由nan loss引起的问题.

I met the same problem as you did. Actually,the memory full use is caused by the data_augmentation_options ssd_random_crop, so you can remove this option and set the batch size to 8 or smaller ie,2,4. When I set batch size to 1,I also met some problems cause by the nan loss.

还有一点就是参数epsilon应该是一个非常小的数字,比如1e-6,根据深度学习"一书.因为epsilon是用来避免分母为零的,但是这里的默认值是1,所以我觉得设置为1是不对的.

Another thing is that the parameter epsilon should be a very small number, such as 1e-6 according to "deep learning" book. Because epsilon is used to avoid a zero denominator, but the default value here is 1, I don't think it is correct to set it to 1.

这篇关于Tensorflow 对象检测训练被杀死,资源匮乏?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆