如何在 Tensorflow 中暂停/恢复训练 [英] How to Pause / Resume Training in Tensorflow
问题描述
这个问题是在保存和恢复文档可用之前提出的.现在我认为这个问题已被弃用,并说人们依赖于 Save and Restore
This question was made before the documentation for save and restore was available. For now I would consider this question deprecated and say people to rely on the official documentation on Save and Restore
旧问题的要点:
我让 TF 在 CIFAR 中运行良好教程.我变了保存 train_dir
的代码(带有检查点和模型)到已知位置.
I got TF working fine for the CIFAR Tutorial. I've changed the code to save the
train_dir
(directory with checkpoint and models) to a know location.
这让我想到了我的问题:我如何暂停和恢复一些使用 TF 进行训练?
Which brings me to my question :How can I pause and resume some training with TF ?
推荐答案
TensorFlow 使用类似图的计算、Nodes(Ops) 和 Edges(Variables aka states),并且它为它的 Vars 提供了一个 Saver
.
TensorFlow uses Graph-like computation, Nodes(Ops) and Edges(Variables aka states) and it provide a Saver
for it's Vars.
因此,由于它是分布式计算,因此您可以在一台机器/处理器中运行部分图形,而在另一台机器/处理器中运行其余部分,同时您可以保存状态(Vars)并在下次提供它以继续您的工作.
So as it's distributed computation you can run part of a graph in one machine/processor and the rest in the other, meanwhile you can save the state(Vars) and feed it next time to continue your work.
saver.save(sess, 'my-model', global_step=0) ==> filename: 'my-model-0'
...
saver.save(sess, 'my-model', global_step=1000) ==> filename: 'my-model-1000'
稍后你可以使用
tf.train.Saver.restore(sess, save_path)
恢复您保存的变量.
这篇关于如何在 Tensorflow 中暂停/恢复训练的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!