找不到检查点文件,正在还原评估图 [英] Checkpoint file not found, restoring evaluation graph
问题描述
我有一个可以在分布式模式下运行4000步的模型.每隔120秒计算一次精度(如提供的示例中所述).但是,有时找不到最后一个检查点文件.
I have a model which runs in a distributed mode for 4000 steps. After every 120s the accuracies are calculated (as is done in the provided examples). However, at times the last checkpoint file is not found.
错误:
无法为检查点gs://path-on-gcs/train/model.ckpt-1485匹配文件
Couldn't match files for checkpoint gs://path-on-gcs/train/model.ckpt-1485
该位置存在检查点文件.在本地运行2000步即可完美运行.
The checkpoint file is present at the location. A local run for 2000 steps runs perfectly.
last_checkpoint = tf.train.latest_checkpoint(train_dir(FLAGS.output_path))
我认为检查点仍在保存过程中,并且实际上未写入文件.尝试在计算精度之前引入等待.但是,这似乎起初是可行的,但该模型仍然失败,并出现类似的问题.
I assume that the checkpoint is still in saving process, and the files are not actually written. Tried introducing a wait before the accuracies are calculated as such. However, this seemed to work at first, the model still failed with a similar issue.
saver.save(session, sv.save_path, global_step)
time.sleep(2) #wait for gcs to be updated
推荐答案
从您的评论中我认为我了解发生了什么.我可能是错的.
From your comment I think I understand what is going on. I may be wrong.
cloud_ml分布式样本
https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/mnist/hptuning/trainer/task.py#L426
默认情况下使用临时文件.因此,它在/tmp
上本地工作.训练完成后,它将结果复制到gs://
上,但不会更正checkpoint
文件,该文件仍然包含对/tmp
上本地模型文件的引用.基本上,这是一个错误.
The cloud_ml distributed sample
https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/mnist/hptuning/trainer/task.py#L426
uses a temporary file by default. As a consequence, it works locally on the /tmp
. Once the training is complete, it copies the result on gs://
but it does not correct the checkpoint
file which stills contains references to local model files on /tmp
. Basically, this is a bug.
为了避免这种情况,您应该使用--write_to_tmp 0
启动培训过程,或者直接修改task.py
文件以禁用此选项.然后Tensorflow将直接在gs://
上运行,因此生成的检查点将保持一致.至少对我有用.
In order to avoid this, you should launch the training process with --write_to_tmp 0
or modify the task.py
file directly for disabling this option. Tensorflow will then directly work on gs://
and the resulting checkpoint will therefore be consistent. At least it worked for me.
检查我的假设是否正确的一种方法是使用gsutils
从本地文件系统上的gs://
复制生成的checkpoint
文件,然后输出其内容.
One way of checking if my assumptions are correct is to copy the resulting checkpoint
file from gs://
on your local filesystem using gsutils
and then output its content.
这篇关于找不到检查点文件,正在还原评估图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!