错误:无法为检查点gs://obj-detection/train/model.ckpt匹配文件 [英] ERROR: Couldn't match files for checkpoint gs://obj-detection/train/model.ckpt

查看:131
本文介绍了错误:无法为检查点gs://obj-detection/train/model.ckpt匹配文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Google Cloud ml上运行我的检测模型,并在运行评估脚本时遇到了此错误.我发现此链接提到了此问题,但似乎这个问题一直没有解决.有人知道如何解决这个问题吗?任何帮助将不胜感激.谢谢.

I run my detection model on google cloud ml and got this error while running the evaluation script. I found this link that mentioned about this issue, but it seems like the issue's till not be solved. Anyone knows how to fix this? Any helps would be appreciated. Thanks.

错误2018-02-04 12:53:10 -0600主副本0无法匹配文件 用于检查点gs://obj-detection/train/model.ckpt-0

ERROR 2018-02-04 12:53:10 -0600 master-replica-0 Couldn't match files for checkpoint gs://obj-detection/train/model.ckpt-0

INFO 2018-02-04 12:53:10 -0600 master-replica-0在找不到模型 gs://obj-detection/train.将在300秒后重试

INFO 2018-02-04 12:53:10 -0600 master-replica-0 No model found in gs://obj-detection/train. Will try again in 300 seconds

INFO 2018-02-04 12:58:10 -0600 master-replica-0开始评估 在2018-02-04-18:58:10

INFO 2018-02-04 12:58:10 -0600 master-replica-0 Starting evaluation at 2018-02-04-18:58:10

错误2018-02-04 12:58:10 -0600主副本0无法匹配文件 用于检查点gs://obj-detection/train/model.ckpt-0

ERROR 2018-02-04 12:58:10 -0600 master-replica-0 Couldn't match files for checkpoint gs://obj-detection/train/model.ckpt-0

INFO 2018-02-04 12:58:10 -0600 master-replica-0在找不到模型 gs://obj-detection/train.将在300秒后重试

INFO 2018-02-04 12:58:10 -0600 master-replica-0 No model found in gs://obj-detection/train. Will try again in 300 seconds

...

培训日志的工作方式如下:

While the training log is working as below:

...大约运行14小时

... at somewhere around 14 hours running

INFO 2018-02-04 05:09:05 -0600 worker-replica-3全局步骤185874: 损耗= 0.7012(0.764秒/步)

INFO 2018-02-04 05:09:05 -0600 worker-replica-3 global step 185874: loss = 0.7012 (0.764 sec/step)

INFO 2018-02-04 05:09:05 -0600 worker-replica-4全局步骤185873: 损失= 0.7749(0.797秒/步)

INFO 2018-02-04 05:09:05 -0600 worker-replica-4 global step 185873: loss = 0.7749 (0.797 sec/step)

INFO 2018-02-04 05:09:05 -0600 worker-replica-2全局步骤185875: 损耗= 0.4939(0.775秒/步)

INFO 2018-02-04 05:09:05 -0600 worker-replica-2 global step 185875: loss = 0.4939 (0.775 sec/step)

INFO 2018-02-04 05:09:05 -0600 master-replica-0全局步骤185877: 损耗= 1.1430(0.850秒/步)

INFO 2018-02-04 05:09:05 -0600 master-replica-0 global step 185877: loss = 1.1430 (0.850 sec/step)

INFO 2018-02-04 05:09:05 -0600 worker-replica-1全局步骤185878: 损失= 0.8231(0.777秒/步)

INFO 2018-02-04 05:09:05 -0600 worker-replica-1 global step 185878: loss = 0.8231 (0.777 sec/step)

INFO 2018-02-04 05:09:05 -0600 worker-replica-0全局步骤185881: 损耗= 0.6470(0.779秒/步)

INFO 2018-02-04 05:09:05 -0600 worker-replica-0 global step 185881: loss = 0.6470 (0.779 sec/step)

推荐答案

要检查的几件事:

  1. 训练代码设置是否实际上是导出检查点?如果您使用的是Estimator,则通常可以正常工作,前提是您使用的是运行Estimator的标准方法(例如,在TF> = 1.4,Estimator.train_and_evaluate中).
  2. 您是否将正确的输出目录传递给保存检查点的代码?例如,培训代码是否可以将检查点输出到本地(临时?)目录而不是GCS?是否可以将检查点保存到GCS上的其他目录中?在此处快速扫描代码+放置适当的打印/记录语句非常有用.
  3. 培训代码多久导出一次检查点?例如,如果仅节省10分钟,则每次成功评估都将收到大约1-2条未找到模型"消息.
  1. Is the training code setup to actually export checkpoints? If you're using an Estimator, this generally works, assuming you're using the standard methods for running the Estimator (e.g., in TF >=1.4, Estimator.train_and_evaluate).
  2. Are you passing the correct output directory to the code that is saving checkpoints? For instance, could the training code be outputting the checkpoint to a local (temporary?) directory instead of GCS? Could it be saving the checkpoints to a different directory on GCS? A quick scan of the code + some well placed print/logging statements are useful here.
  3. How frequently does the training code export checkpoints? e.g., if it saves only 10 minutes, then you would expect about 1-2 "no model found" messages for every successful evaluation.

这篇关于错误:无法为检查点gs://obj-detection/train/model.ckpt匹配文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆