如何在 Tensorflow 中暂停/恢复训练 [英] How to Pause / Resume Training in Tensorflow

查看:137
本文介绍了如何在 Tensorflow 中暂停/恢复训练的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题是在保存和恢复文档可用之前提出的.现在我认为这个问题已被弃用,并说人们依赖于 Save and Restore

This question was made before the documentation for save and restore was available. For now I would consider this question deprecated and say people to rely on the official documentation on Save and Restore

旧问题的要点:

我让 TFCIFAR 中运行良好教程.我变了保存 train_dir 的代码(带有检查点和模型)到已知位置.

I got TF working fine for the CIFAR Tutorial. I've changed the code to save the train_dir (directory with checkpoint and models) to a know location.

这让我想到了我的问题:我如何暂停和恢复一些使用 TF 进行训练?

Which brings me to my question :How can I pause and resume some training with TF ?

推荐答案

TensorFlow 使用类似图的计算、Nodes(Ops) 和 Edges(Variables aka states),并且它为它的 Vars 提供了一个 Saver.

TensorFlow uses Graph-like computation, Nodes(Ops) and Edges(Variables aka states) and it provide a Saver for it's Vars.

因此,由于它是分布式计算,因此您可以在一台机器/处理器中运行部分图形,而在另一台机器/处理器中运行其余部分,同时您可以保存状态(Vars)并在下次提供它以继续您的工作.

So as it's distributed computation you can run part of a graph in one machine/processor and the rest in the other, meanwhile you can save the state(Vars) and feed it next time to continue your work.

saver.save(sess, 'my-model', global_step=0) ==> filename: 'my-model-0'
...
saver.save(sess, 'my-model', global_step=1000) ==> filename: 'my-model-1000'

稍后你可以使用

tf.train.Saver.restore(sess, save_path)

恢复您保存的变量.

Saver 使用

这篇关于如何在 Tensorflow 中暂停/恢复训练的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆