受监控的培训课程如何运作? [英] How do Monitored Training Sessions work?

查看：31 发布时间：2021/9/5 19:09:21 python tensorflow

本文介绍了受监控的培训课程如何运作?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试了解使用 tf.Session 和 tf.train.MonitoredTrainingSession 之间的区别，以及我可能更喜欢哪个.使用后者时，似乎可以避免很多杂事"，例如初始化变量、启动队列运行程序或设置文件编写器以进行汇总操作.另一方面，对于受监控的培训课程，我无法明确指定要使用的计算图.所有这些对我来说似乎都很神秘.这些类的创建背后是否有一些我不理解的潜在哲学?

解决方案

我无法就这些类的创建方式提供一些见解，但这里有一些我认为与您如何使用它们相关的内容.

tf.Session 是 python TensorFlow API 中的一个低级对象，而正如您所说，tf.train.MonitoredTrainingSession 带有许多方便的功能，在大多数常见情况下尤其有用.

在描述tf.train.MonitoredTrainingSession 的一些好处之前，让我回答一下关于会话使用的图表的问题.您可以通过使用上下文管理器和 your_graph.as_default() 来指定 MonitoredTrainingSession 使用的 tf.Graph:

from __future__ import print_function将张量流导入为 tf定义示例():g1 = tf.Graph()使用 g1.as_default():# 在 `g` 中定义操作和张量.c1 = tf.constant(42)断言 c1.graph 是 g1g2 = tf.Graph()使用 g2.as_default():# 在 `g` 中定义操作和张量.c2 = tf.constant(3.14)断言 c2.graph 是 g2# MonitoredTrainingSession 示例使用 g1.as_default():使用 tf.train.MonitoredTrainingSession() 作为 sess:打印(c1.eval(会话=sess))# 下一行引发# ValueError:无法使用给定的会话来评估张量:# 张量图与会话图不同.尝试:打印(c2.eval(会话=sess))除了 ValueError 作为 e:打印(e)# 会话示例使用 tf.Session(graph=g2) 作为 sess:打印(c2.eval(会话=sess))# 下一行引发# ValueError:无法使用给定的会话来评估张量:# 张量图与会话图不同.尝试:打印(c1.eval(会话=sess))除了 ValueError 作为 e:打印(e)如果 __name__ == '__main__':例子()

所以，正如你所说，使用 MonitoredTrainingSession 的好处是，这个对象负责

初始化变量，
启动队列运行器以及
设置文件编写器，

但它还有一个好处是使您的代码易于分发，因为它的工作方式也有所不同，具体取决于您是否将正在运行的进程指定为主进程.

例如，您可以运行以下内容:

def run_my_model(train_op, session_args):使用 tf.train.MonitoredTrainingSession(**session_args) 作为 sess:sess.run(train_op)

您将以非分布式方式调用:

run_my_model(train_op, {})`

或以分布式方式(有关输入的更多信息，请参阅分布式文档):

run_my_model(train_op, {"master": server.target,"is_chief": (FLAGS.task_index == 0)})

另一方面，使用原始 tf.Session 对象的好处是，您没有 tf.train.MonitoredTrainingSession 的额外好处，如果您不打算使用它们，或者您想获得更多控制权(例如队列的启动方式)，这会很有用.

编辑(根据评论):对于操作初始化，您必须执行以下操作(参见官方文档:

# 定义你的图表和你的操作init_op = tf.global_variables_initializer()使用 tf.Session() 作为 sess:sess.run(init_p)sess.run(your_graph_ops,...)

对于 QueueRunner，我会向您推荐官方文档，在那里您会找到更完整的信息示例.

编辑 2:

要了解 tf.train.MonitoredTrainingSession 如何工作的主要概念是 _WrappedSession 类:

<块引用>

这个包装器用作各种会话包装器的基类提供额外的功能，例如监控、协调、和恢复.

tf.train.MonitoredTrainingSession 有效(从 1.1 版)这样:

首先检查它是负责人还是工人(参见分布式文档，用于词汇问题).
它开始提供已提供的钩子(例如，StopAtStepHook 将在此阶段仅检索 global_step 张量.
它创建了一个会话，它是一个 Chief(或 Worker 会话)包装成一个 _HookedSession 包装成一个 _CoordinatedSession 包装成 _RecoverableSession.
Chief/Worker 会话负责运行 Scaffold 提供的初始化操作.<块引用>
```
 scaffold:用于收集或构建支持操作的 `Scaffold`.如果未指定创建默认值.它用于完成图形.
```
chief 会话还负责所有检查点部分:例如使用 Scaffold 中的 Saver 从检查点恢复.
_HookedSession 基本上是用来装饰 run 方法的:当它调用 _call_hook_before_run 和 after_run 方法时相关的.
在创建时，_CoordinatedSession 构建了一个 Coordinator，它启动队列运行器并负责关闭它们.
_RecoverableSession 将确保在 tf.errors.AbortedError 的情况下重试.

总而言之，tf.train.MonitoredTrainingSession 避免了大量样板代码，同时可以通过钩子机制轻松扩展.

I'm trying to understand the difference between using tf.Session and tf.train.MonitoredTrainingSession, and where I might prefer one over the other. It seems that when I use the latter, I can avoid many "chores" such as initializing variables, starting queue runners, or setting up file writers for summary operations. On the other hand, with a monitored training session, I cannot specify the computation graph I want to use explicitly. All of this seems rather mysterious to me. Is there some underlying philosophy behind how these classes were created that I'm not understanding?

解决方案

I can't give some insights on how these classes were created, but here are a few things which I think are relevants on how you could use them.

The tf.Session is a low level object in the python TensorFlow API while, as you said, the tf.train.MonitoredTrainingSession comes with a lot of handy features, especially useful in most of the common cases.

Before describing some of the benefits of tf.train.MonitoredTrainingSession, let me answer the question about the graph used by the session. You can specify the tf.Graph used by the MonitoredTrainingSession by using a context manager with your_graph.as_default():

from __future__ import print_function
import tensorflow as tf

def example():
    g1 = tf.Graph()
    with g1.as_default():
        # Define operations and tensors in `g`.
        c1 = tf.constant(42)
        assert c1.graph is g1

    g2 = tf.Graph()
    with g2.as_default():
        # Define operations and tensors in `g`.
        c2 = tf.constant(3.14)
        assert c2.graph is g2

    # MonitoredTrainingSession example
    with g1.as_default():
        with tf.train.MonitoredTrainingSession() as sess:
            print(c1.eval(session=sess))
            # Next line raises
            # ValueError: Cannot use the given session to evaluate tensor:
            # the tensor's graph is different from the session's graph.
            try:
                print(c2.eval(session=sess))
            except ValueError as e:
                print(e)

    # Session example
    with tf.Session(graph=g2) as sess:
        print(c2.eval(session=sess))
        # Next line raises
        # ValueError: Cannot use the given session to evaluate tensor:
        # the tensor's graph is different from the session's graph.
        try:
            print(c1.eval(session=sess))
        except ValueError as e:
            print(e)

if __name__ == '__main__':
    example()

So, as you said, the benefits of using MonitoredTrainingSession are that, this object takes care of

initialising variables,
starting queue runner as well as
setting up the file writers,

but it has also the benefit of making your code easy to distribute as it also works differently depending if you specified the running process as a master or not.

For example you could run something like:

def run_my_model(train_op, session_args):
    with tf.train.MonitoredTrainingSession(**session_args) as sess:
        sess.run(train_op)

that you would call in a non-distributed way:

run_my_model(train_op, {})`

or in a distributed way (see the distributed doc for more information on the inputs):

run_my_model(train_op, {"master": server.target,
                        "is_chief": (FLAGS.task_index == 0)})

On the other hand, the benefit of using the raw tf.Session object is that, you don't have the extra benefits of tf.train.MonitoredTrainingSession, which can be useful if you don't plan to use them or if you want to get more control (for example on how the queues are started).

EDIT (as per comment): For the op initialisation, you would have to do something like (cf. official doc:

# Define your graph and your ops
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init_p)
    sess.run(your_graph_ops,...)

For the QueueRunner, I would refer you to the official doc where you will find more complete examples.

EDIT2:

The main concept to understand to get a sense on how tf.train.MonitoredTrainingSession works is the _WrappedSession class:

This wrapper is used as a base class for various session wrappers that provide additional functionality such as monitoring, coordination, and recovery.

The tf.train.MonitoredTrainingSession works (as of version 1.1) this way:

It first checks if it is a chief or a worker (cf. the distributed doc for lexical question).
It begins the hooks which have been provided (for example, StopAtStepHook would just retrieve the global_step tensor at this stage.
It creates a session which is a Chief (or Worker session) wrapped into a _HookedSession wrapped into a _CoordinatedSession wrapped into a _RecoverableSession.
The Chief/Worker sessions are in charge of running initialising ops provided by the Scaffold.
```
  scaffold: A `Scaffold` used for gathering or building supportive ops. If
not specified a default one is created. It's used to finalize the graph.
```
The chief session also takes care of all the checkpoint parts: e.g. restoring from checkpoints using the Saver from the Scaffold.
The _HookedSession is basically there to decorate the run method: it calls the _call_hook_before_run and after_run methods when relevant.
At creation the _CoordinatedSession builds a Coordinator which starts the queue runners and will be responsible of closing them.
The _RecoverableSession will insures that there is retry in case of tf.errors.AbortedError.

In conclusion, the tf.train.MonitoredTrainingSession avoids a lot of boiler plate code while being easily extendable with the hooks mechanism.

这篇关于受监控的培训课程如何运作?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

受监控的培训课程如何运作? [英] How do Monitored Training Sessions work?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

受监控的培训课程如何运作? [英] How do Monitored Training Sessions work?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭