找不到检查点文件,正在还原评估图 [英] Checkpoint file not found, restoring evaluation graph

查看:92
本文介绍了找不到检查点文件,正在还原评估图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个可以在分布式模式下运行4000步的模型.每隔120秒计算一次精度(如提供的示例中所述).但是,有时找不到最后一个检查点文件.

I have a model which runs in a distributed mode for 4000 steps. After every 120s the accuracies are calculated (as is done in the provided examples). However, at times the last checkpoint file is not found.

错误:

无法为检查点gs://path-on-gcs/train/model.ckpt-1485匹配文件

Couldn't match files for checkpoint gs://path-on-gcs/train/model.ckpt-1485

该位置存在检查点文件.在本地运行2000步即可完美运行.

The checkpoint file is present at the location. A local run for 2000 steps runs perfectly.

last_checkpoint = tf.train.latest_checkpoint(train_dir(FLAGS.output_path))

我认为检查点仍在保存过程中,并且实际上未写入文件.尝试在计算精度之前引入等待.但是,这似乎起初是可行的,但该模型仍然失败,并出现类似的问题.

I assume that the checkpoint is still in saving process, and the files are not actually written. Tried introducing a wait before the accuracies are calculated as such. However, this seemed to work at first, the model still failed with a similar issue.

saver.save(session, sv.save_path, global_step)
time.sleep(2) #wait for gcs to be updated

推荐答案

从您的评论中我认为我了解发生了什么.我可能是错的.

From your comment I think I understand what is going on. I may be wrong.

cloud_ml分布式样本 https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/mnist/hptuning/trainer/task.py#L426 默认情况下使用临时文件.因此,它在/tmp上本地工作.训练完成后,它将结果复制到gs://上,但不会更正checkpoint文件,该文件仍然包含对/tmp上本地模型文件的引用.基本上,这是一个错误.

The cloud_ml distributed sample https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/mnist/hptuning/trainer/task.py#L426 uses a temporary file by default. As a consequence, it works locally on the /tmp. Once the training is complete, it copies the result on gs:// but it does not correct the checkpoint file which stills contains references to local model files on /tmp. Basically, this is a bug.

为了避免这种情况,您应该使用--write_to_tmp 0启动培训过程,或者直接修改task.py文件以禁用此选项.然后Tensorflow将直接在gs://上运行,因此生成的检查点将保持一致.至少对我有用.

In order to avoid this, you should launch the training process with --write_to_tmp 0 or modify the task.py file directly for disabling this option. Tensorflow will then directly work on gs:// and the resulting checkpoint will therefore be consistent. At least it worked for me.

检查我的假设是否正确的一种方法是使用gsutils从本地文件系统上的gs://复制生成的checkpoint文件,然后输出其内容.

One way of checking if my assumptions are correct is to copy the resulting checkpoint file from gs:// on your local filesystem using gsutils and then output its content.

这篇关于找不到检查点文件,正在还原评估图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆