在云机器学习引擎上运行时，Tensorflow对象检测train.py失败 [英] Tensorflow object detection train.py fails when running on cloud machine learning engine

查看：331 发布时间：2018/5/10 13:43:35 machine-learning tensorflow google-cloud-platform object-detection

本文介绍了在云机器学习引擎上运行时，Tensorflow对象检测train.py失败的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个tensorflow对象检测API在本地工作的小例子。一切看起来不错。我的目标是使用他们的脚本在Google机器学习引擎中运行，我之前已经广泛使用它。我正在关注这些

另一个想法可能是检查点在上载期间不知何故被破坏。

在.config中

  from_detection_checkpoint：false

产生相同的前提条件错误，所以它不能是模型。

解决方案

根本原因是轻微的错字：

   -  pipeline_config_path = $ {PIPELINE_CONFIG_PATH}

有额外的空间。试试这个：
$ b $ pre $ gcloud ml-engine作业提交培训$ {JOB_ID} _train\ --job -dir = $ {TRAIN_DIR} \ --packages dist / object_detection-0.1.tar.gz，slim / dist / slim-0.1.tar.gz \ --module-name object_detection。 train \ --region us-central1 \ --config $ {PIPELINE_YAML} \ - \ --train_dir = $ {TRAIN_DIR} \ --pipeline_config_path = $ {PIPELINE_CONFIG_PATH}

I have a small working example of the tensorflow object detection api working locally. Everything looks great. My goal is to use their scripts to run in Google Machine Learning Engine, which i've used extensively in the past. I am following these docs.

Declare some relevant variables
declare PROJECT=$(gcloud config list project --format "value(core.project)") declare BUCKET="gs://${PROJECT}-ml" declare MODEL_NAME="DeepMeerkatDetection" declare FOLDER="${BUCKET}/${MODEL_NAME}" declare JOB_ID="${MODEL_NAME}_$(date +%Y%m%d_%H%M%S)" declare TRAIN_DIR="${FOLDER}/${JOB_ID}" declare EVAL_DIR="${BUCKET}/${MODEL_NAME}/${JOB_ID}_eval" declare PIPELINE_CONFIG_PATH="${FOLDER}/faster_rcnn_inception_resnet_v2_atrous_coco_cloud.config" declare PIPELINE_YAML="/Users/Ben/Documents/DeepMeerkat/training/Detection/cloud.yml"
My yaml looks like
trainingInput: runtimeVersion: "1.0" scaleTier: CUSTOM masterType: standard_gpu workerCount: 5 workerType: standard_gpu parameterServerCount: 3 parameterServerType: standard
The relevant paths are set in the config, e.g
fine_tune_checkpoint: "gs://api-project-773889352370-ml/DeepMeerkatDetection/checkpoint/faster_rcnn_inception_resnet_v2_atrous_coco_11_06_2017/model.ckpt"
I've packaged object detection and slim using setup.py

Running
gcloud ml-engine jobs submit training "${JOB_ID}_train" \ --job-dir=${TRAIN_DIR} \ --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \ --module-name object_detection.train \ --region us-central1 \ --config ${PIPELINE_YAML} \ -- \ --train_dir=${TRAIN_DIR} \ --pipeline_config_path= ${PIPELINE_CONFIG_PATH}
yields a tensorflow (import?) error. Its a bit cryptic
insertId: "1inuq6gg27fxnkc" logName: "projects/api-project-773889352370/logs/ml.googleapis.com%2FDeepMeerkatDetection_20171017_141321_train" receiveTimestamp: "2017-10-17T21:38:34.435293164Z" resource: {…} severity: "ERROR" textPayload: "The replica ps 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 198, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 145, in main model_config, train_config, input_config = get_configs_from_multiple_files() File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 127, in get_configs_from_multiple_files text_format.Merge(f.read(), train_config) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 112, in read return pywrap_tensorflow.ReadFromStream(self._read_buf, length, status) File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) FailedPreconditionError: .
I've seen this error in other questions related to prediction on Machine Learning Engine, suggesting this error probably(?) is not directly related to the object detection code, but it feels like its not being packaged correctly, missing dependencies? I've updated my gcloud to the latest version.
Bens-MacBook-Pro:research ben$ gcloud --version Google Cloud SDK 175.0.0 bq 2.0.27 core 2017.10.09 gcloud gsutil 4.27
Hard to see how its related to this problem here

FailedPreconditionError when running TF Object Detection API with own model

why would code need to initialized differently in the cloud?

Update #1.

The curious thing is that the eval.py works fine, so it can't be a path to the config file, or anything that train.py and eval.py share. Eval.py patiently sits and waits for model checkpoints to be created.

Another idea might be that the checkpoint is somehow been corrupted during upload. We can test this bypassing and training from scratch.

In .config
from_detection_checkpoint: false
that yields the the same precondition error, so it can't be the model.
解决方案
The root cause is a slight typo:
--pipeline_config_path= ${PIPELINE_CONFIG_PATH}
has an extra space. Try this:
gcloud ml-engine jobs submit training "${JOB_ID}_train" \ --job-dir=${TRAIN_DIR} \ --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \ --module-name object_detection.train \ --region us-central1 \ --config ${PIPELINE_YAML} \ -- \ --train_dir=${TRAIN_DIR} \ --pipeline_config_path=${PIPELINE_CONFIG_PATH}

这篇关于在云机器学习引擎上运行时，Tensorflow对象检测train.py失败的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在云机器学习引擎上运行时，Tensorflow对象检测train.py失败 [英] Tensorflow object detection train.py fails when running on cloud machine learning engine

问题描述

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

在云机器学习引擎上运行时，Tensorflow对象检测train.py失败 [英] Tensorflow object detection train.py fails when running on cloud machine learning engine

问题描述

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭