重新训练初始Google Cloud停留在全球第0步 [英] Re-training inception google cloud stuck at global step 0

查看:61
本文介绍了重新训练初始Google Cloud停留在全球第0步的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在遵循鲜花教程,以重新训练Google Cloud ml上的初始语言.我可以运行教程,训练,预测,一切都很好.

I am following the flowers tutorials for re-training inception on google cloud ml. I can run the tutorial, train, predict, just fine.

然后,我用flowers数据集替换了自己的测试数据集.图像数字的光学字符识别.

I then substituted the flowers dataset for a test dataset of my own. Optical character recognition of image digits.

我的完整代码是此处

标签

评估设置

培训设置

从google提供的最新docker构建中运行.

Running from recent docker build provided by google.

`docker run -it -p "127.0.0.1:8080:8080" --entrypoint=/bin/bash  gcr.io/cloud-datalab/datalab:local-20161227

我可以预处理文件,并使用提交培训工作

I can preprocess files, and submit the training job using

  # Submit training job.
gcloud beta ml jobs submit training "$JOB_ID" \
  --module-name trainer.task \
  --package-path trainer \
  --staging-bucket "$BUCKET" \
  --region us-central1 \
  -- \
  --output_path "${GCS_PATH}/training" \
  --eval_data_paths "${GCS_PATH}/preproc/eval*" \
  --train_data_paths "${GCS_PATH}/preproc/train*"

,但是它永远不会超过全局步骤0.Flowers教程大约在免费层上约1个小时进行了培训.我已经放弃了长达11个小时的培训.没有动静.

but it never makes it past global step 0. The flowers tutorial trained in about ~ 1 hr on the free tier. I have let my training go as long as 11 hrs. No movement.

查看堆栈驱动程序,没有任何进展.

Looking over at stackdriver, nothing progresses.

我还尝试了一个包含20个训练图像和10个评估图像的小型玩具数据集.同样的问题.

I have also tried a tiny toy dataset of 20 training images, and 10 eval images. Same issue.

GCS Bucket最终看起来像这样

The GCS Bucket ends up looking like this

也许不足为奇,我无法在张量板上看到此日志,什么也没显示.

Perhaps unsurprisingly, I can't visualize this log in tensorboard, nothing to show.

完整的培训日志:

INFO    2017-01-10 17:22:00 +0000       unknown_task            Validating job requirements...
INFO    2017-01-10 17:22:01 +0000       unknown_task            Job creation request has been successfully validated.
INFO    2017-01-10 17:22:01 +0000       unknown_task            Job MeerkatReader_MeerkatReader_20170110_170701 is queued.
INFO    2017-01-10 17:22:07 +0000       unknown_task            Waiting for job to be provisioned.
INFO    2017-01-10 17:22:07 +0000       unknown_task            Waiting for TensorFlow to start.
INFO    2017-01-10 17:22:10 +0000       master-replica-0                Running task with arguments: --cluster={"master": ["master-d4f6-0:2222"]} --task={"type": "master", "index": 0} --job={
INFO    2017-01-10 17:22:10 +0000       master-replica-0                  "package_uris": ["gs://api-project-773889352370-ml/MeerkatReader_MeerkatReader_20170110_170701/f78d90a60f615a2d108d06557818eb4f82ffa94a/trainer-0.1.tar.gz"],
INFO    2017-01-10 17:22:10 +0000       master-replica-0                  "python_module": "trainer.task",
INFO    2017-01-10 17:22:10 +0000       master-replica-0                  "args": ["--output_path", "gs://api-project-773889352370-ml/MeerkatReader/MeerkatReader_MeerkatReader_20170110_170701/training", "--eval_data_paths", "gs://api-project-773889352370-ml/MeerkatReader/MeerkatReader_MeerkatReader_20170110_170701/preproc/eval*", "--train_data_paths", "gs://api-project-773889352370-ml/MeerkatReader/MeerkatReader_MeerkatReader_20170110_170701/preproc/train*"],
INFO    2017-01-10 17:22:10 +0000       master-replica-0                  "region": "us-central1"
INFO    2017-01-10 17:22:10 +0000       master-replica-0                } --beta
INFO    2017-01-10 17:22:10 +0000       master-replica-0                Downloading the package: gs://api-project-773889352370-ml/MeerkatReader_MeerkatReader_20170110_170701/f78d90a60f615a2d108d06557818eb4f82ffa94a/trainer-0.1.tar.gz
INFO    2017-01-10 17:22:10 +0000       master-replica-0                Running command: gsutil -q cp gs://api-project-773889352370-ml/MeerkatReader_MeerkatReader_20170110_170701/f78d90a60f615a2d108d06557818eb4f82ffa94a/trainer-0.1.tar.gz trainer-0.1.tar.gz
INFO    2017-01-10 17:22:12 +0000       master-replica-0                Building wheels for collected packages: trainer
INFO    2017-01-10 17:22:12 +0000       master-replica-0                creating '/tmp/tmpSgdSzOpip-wheel-/trainer-0.1-cp27-none-any.whl' and adding '.' to it
INFO    2017-01-10 17:22:12 +0000       master-replica-0                adding 'trainer/model.py'
INFO    2017-01-10 17:22:12 +0000       master-replica-0                adding 'trainer/util.py'
INFO    2017-01-10 17:22:12 +0000       master-replica-0                adding 'trainer/preprocess.py'
INFO    2017-01-10 17:22:12 +0000       master-replica-0                adding 'trainer/task.py'
INFO    2017-01-10 17:22:12 +0000       master-replica-0                adding 'trainer-0.1.dist-info/metadata.json'
INFO    2017-01-10 17:22:12 +0000       master-replica-0                adding 'trainer-0.1.dist-info/WHEEL'
INFO    2017-01-10 17:22:12 +0000       master-replica-0                adding 'trainer-0.1.dist-info/METADATA'
INFO    2017-01-10 17:22:12 +0000       master-replica-0                  Running setup.py bdist_wheel for trainer: finished with status 'done'
INFO    2017-01-10 17:22:12 +0000       master-replica-0                  Stored in directory: /root/.cache/pip/wheels/e8/0c/c7/b77d64796dbbac82503870c4881d606fa27e63942e07c75f0e
INFO    2017-01-10 17:22:12 +0000       master-replica-0                Successfully built trainer
INFO    2017-01-10 17:22:13 +0000       master-replica-0                Running command: python -m trainer.task --output_path gs://api-project-773889352370-ml/MeerkatReader/MeerkatReader_MeerkatReader_20170110_170701/training --eval_data_paths gs://api-project-773889352370-ml/MeerkatReader/MeerkatReader_MeerkatReader_20170110_170701/preproc/eval* --train_data_paths gs://api-project-773889352370-ml/MeerkatReader/MeerkatReader_MeerkatReader_20170110_170701/preproc/train*
INFO    2017-01-10 17:22:14 +0000       master-replica-0                Starting master/0
INFO    2017-01-10 17:22:14 +0000       master-replica-0                Initialize GrpcChannelCache for job master -> {0 -> localhost:2222}
INFO    2017-01-10 17:22:14 +0000       master-replica-0                Started server with target: grpc://localhost:2222
ERROR   2017-01-10 17:22:16 +0000       master-replica-0                device_filters: "/job:ps"
INFO    2017-01-10 17:22:19 +0000       master-replica-0                global_step/sec: 0

只需重复最后一行,直到我杀死它.

Just repeating the last line until I kill it.

我对这项服务的思维模式是否不正确?欢迎所有建议.

Is my mental model for this service incorrect? All suggestions welcome.

推荐答案

一切都很好.我怀疑您的数据有问题.具体来说,我怀疑TF无法从您的GCS文件中读取任何数据(它们为空吗?)?结果,当您调用train时,TF最终会阻止尝试读取一批它无法读取的数据.

Everything looks fine. My suspicion is that there is a problem with your data. Specifically I suspect TF is unable to read any data from your GCS files (are they empty?)? As a result when you invoke train, TF ends up blocking trying to read a batch of data which it can't do.

我建议在 Trainer.run_training .这将告诉您那条线是否卡在其中.

I would suggest adding logging statements around the call to session.run in Trainer.run_training. This will tell you whether that is the line where it is getting stuck.

我还建议您检查GCS文件的大小.

I'd also suggest checking the sizes of your GCS files.

TensorFlow还具有实验性 RunOptions ,它允许您为Session.run指定超时.准备好此功能后,这对于确保代码不会永远被阻止可能很有用.

TensorFlow also has an experimental RunOptions which allows you to specify a timeout for Session.run. Once this feature is ready, this might be useful for ensuring code doesn't block forever.

这篇关于重新训练初始Google Cloud停留在全球第0步的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆