Tensorflow 对象检测训练被杀死，资源匮乏? [英] Tensorflow Object Detection Training Killed, Resource starvation?

查看：48 发布时间：2022/1/4 22:33:58 tensorflow linux-kernel protocol-buffers object-detection training-data

本文介绍了Tensorflow 对象检测训练被杀死，资源匮乏?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在此处已部分询问此问题和这里没有跟进，所以也许这里不是问这个问题，但我想出了更多的信息，希望能得到这些问题的答案.

This question has partially been asked here and here with no follow-ups, so maybe this is not the venue to ask this question, but I've figured out a little more information that I'm hoping might get an answer to these questions.

我一直在尝试在我自己的大约 1k 张照片库上训练 object_detection.我一直在使用提供的管道配置文件ssd_inception_v2_pets.config".我相信我已经正确设置了训练数据.该程序似乎开始训练就好了.当它无法读取数据时，它会发出错误警报，我修复了这个问题.

I've been attempting to train object_detection on my own library of roughly 1k photos. I've been using the provided pipeline config file "ssd_inception_v2_pets.config". And I've set up the training data properly, I believe. The program appears to start training just fine. When it couldn't read the data, it alerted with an error, and I fixed that.

我的 train_config 设置如下，但我更改了一些数字以尝试让它以更少的资源运行.

My train_config settings are as follows, though I've changed a few of the numbers in order to try and get it to run with fewer resources.

train_config: {
  batch_size: 1000 #also tried 1, 10, and 100
  optimizer {
    rms_prop_optimizer: {
      learning_rate: {
        exponential_decay_learning_rate {
          initial_learning_rate: 0.04  # also tried .004
          decay_steps: 800 # also tried 800720. 80072
          decay_factor: 0.95
        }
      }
      momentum_optimizer_value: 0.9
      decay: 0.9
      epsilon: 1.0
    }
  }
  fine_tune_checkpoint: "~/Downloads/ssd_inception_v2_coco_11_06_2017/model.ckpt" #using inception checkpoint
  from_detection_checkpoint: true
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    ssd_random_crop {
    }
  }
}

基本上，我认为正在发生的事情是计算机的资源消耗非常快，我想知道是否有人进行了需要更多时间来构建但使用更少资源的优化?

Basically, what I think is happening is that the computer is getting resource starved very quickly, and I'm wondering if anyone has an optimization that takes more time to build, but uses fewer resources?

或者我是否错误地解释了为什么进程会被杀死，有没有办法让我从内核中获取更多相关信息?

OR am I wrong about why the process is getting killed, and is there a way for me to get more information about that from the kernel?

这是进程被kill后得到的Dmesg信息.

This is the Dmesg information that I get after the process is killed.

[711708.975215] Out of memory: Kill process 22087 (python) score 517 or sacrifice child
[711708.975221] Killed process 22087 (python) total-vm:9086536kB, anon-rss:6114136kB, file-rss:24kB, shmem-rss:0kB

Tensorflow 对象检测训练被杀死，资源匮乏? [英] Tensorflow Object Detection Training Killed, Resource starvation?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Tensorflow 对象检测训练被杀死，资源匮乏? [英] Tensorflow Object Detection Training Killed, Resource starvation?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭