Tensorflow中的多个会话和图形(在同一过程中) [英] Multiple sessions and graphs in Tensorflow (in the same process)

查看:165
本文介绍了Tensorflow中的多个会话和图形(在同一过程中)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在训练一个模型,其中输入向量是另一个模型的输出.这涉及在同一过程中从检查点文件还原第一个模型,同时从头开始初始化第二个模型(使用tf.initialize_variables()).

I'm training a model where the input vector is the output of another model. This involves restoring the first model from a checkpoint file while initializing the second model from scratch (using tf.initialize_variables()) in the same process.

有大量的代码和抽象,因此我只在此处粘贴相关部分.

There is a substantial amount of code and abstraction, so I'm just pasting the relevant sections here.

以下是恢复代码:

self.variables = [var for var in all_vars if var.name.startswith(self.name)]
saver = tf.train.Saver(self.variables, max_to_keep=3)
self.save_path = tf.train.latest_checkpoint(os.path.dirname(self.checkpoint_path))

if should_restore:
    self.saver.restore(self.sess, save_path)
else:
    self.sess.run(tf.initialize_variables(self.variables))

每个模型都在其自己的图形和会话内,如下所示:

Each model is scoped within its own graph and session, like this:

 self.graph = tf.Graph()
 self.sess = tf.Session(graph=self.graph)

 with self.sess.graph.as_default():
    # Create variables and ops.

每个模型中的所有变量都在variable_scope上下文管理器中创建.

All the variables within each model are created within the variable_scope context manager.

喂养方式如下:

  • 后台线程在input = scipy.misc.imread(X)上调用sess.run(inference_op)并将结果放入阻塞线程安全的队列中.
  • 主训练循环从队列中读取并在第二个模型上调用sess.run(train_op).
  • A background thread calls sess.run(inference_op) on input = scipy.misc.imread(X) and puts the result in a blocking thread-safe queue.
  • The main training loop reads from the queue and calls sess.run(train_op) on the second model.

问题:
我观察到,即使在训练的第一个迭代中(第二个模型),损耗值在整个运行过程中也会急剧变化(并在几次迭代中变为nan).我确认第一个模型的输出每次都完全相同.注释掉第一个模型的sess.run并将其替换为来自腌制文件的相同输入不会显示此行为.

PROBLEM:
I am observing that the loss values, even in the very first iteration of the training (second model) keep changing drastically across runs (and become nan in a few iterations). I confirmed that the output of the first model is exactly the same everytime. Commenting out the sess.run of the first model and replacing it with identical input from a pickled file does not show this behaviour.

这是train_op:

    loss_op = tf.nn.sparse_softmax_cross_entropy(network.feedforward())
    # Apply gradients.
    with tf.control_dependencies([loss_op]):
        opt = tf.train.GradientDescentOptimizer(lr)
        grads = opt.compute_gradients(loss_op)
        apply_gradient_op = opt.apply_gradients(grads)

    return apply_gradient_op

我知道这很模糊,但是很高兴提供更多详细信息.感谢您的帮助!

I know this is vague, but I'm happy to provide more details. Any help is appreciated!

推荐答案

该问题很可能是由于并发执行不同的会话对象而发生的.我将第一个模型的会话从后台线程移到了主线程,重复了几次受控实验(运行超过24小时并达到收敛),却从未观察到NaN.另一方面,并​​发执行会在几分钟内使模型产生分歧.

The issue is most certainly happening due to concurrent execution of different session objects. I moved the first model's session from the background thread to the main thread, repeated the controlled experiment several times (running for over 24 hours and reaching convergence) and never observed NaN. On the other hand, concurrent execution diverges the model within a few minutes.

我已经对代码进行了重组,以对所有模型使用通用的会话对象.

I've restructured my code to use a common session object for all models.

这篇关于Tensorflow中的多个会话和图形(在同一过程中)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆