将 is_training 设置为 false 时,MobileNet 不可用 [英] MobileNet is not usable when set is_training to false
问题描述
对这个问题更准确的描述是当 is_training 没有明确设置为 true 时,MobileNet 表现不佳.我指的是 TensorFlow 在其模型存储库中提供的 MobileNet https://github.com/tensorflow/models/blob/master/slim/nets/mobilenet_v1.py.
The more accurate description for this issue is that MobileNet behaves bad when is_training is not set to true explicitly. And I'm referring to the MobileNet that is provided by TensorFlow in their model repository https://github.com/tensorflow/models/blob/master/slim/nets/mobilenet_v1.py.
这就是我创建网络的方式(phase_train=True):
This is how I create the net (phase_train=True):
with slim.arg_scope(mobilenet_v1.mobilenet_v1_arg_scope(is_training=phase_train)):
features, endpoints = mobilenet_v1.mobilenet_v1(
inputs=images_placeholder, features_layer_size=features_layer_size, dropout_keep_prob=dropout_keep_prob,
is_training=phase_train)
我正在训练一个识别网络,在训练时我在 LFW 上进行测试.我在训练期间得到的结果随着时间的推移不断提高,并获得了良好的准确性.
I'm training a recognition network and while training I test on LFW. The results that I get during the training are improving over time and getting a good accuracy.
在部署之前我冻结了图表.如果我用 is_training=True 冻结图表,我在 LFW 上得到的结果与训练期间的结果相同.但是如果我设置 is_training=False 我得到的结果就像网络根本没有训练过......
Before deployment I freeze the graph. if I freeze the graph with is_training=True the results that I get on LFW are the same as during training. But if I set is_training=False I get results like the network haven't trained at all...
这种行为实际上发生在 Inception 等其他网络中.
This behavior actually happens with other networks like Inception.
我倾向于认为我在这里遗漏了一些非常基本的东西,而且这不是 TensorFlow 中的错误......
I tend to believe that I miss something very fundamental here and that this is not a bug in TensorFlow...
任何帮助将不胜感激.
添加更多代码...
这是我准备培训的方式:
This is how I prepare for training:
images_placeholder = tf.placeholder(tf.float32, shape=(None, image_size, image_size, 1), name='input')
labels_placeholder = tf.placeholder(tf.int32, shape=(None))
dropout_placeholder = tf.placeholder_with_default(1.0, shape=(), name='dropout_keep_prob')
phase_train_placeholder = tf.Variable(True, name='phase_train')
global_step = tf.Variable(0, name='global_step', trainable=False)
# build graph
with slim.arg_scope(mobilenet_v1.mobilenet_v1_arg_scope(is_training=phase_train_placeholder)):
features, endpoints = mobilenet_v1.mobilenet_v1(
inputs=images_placeholder, features_layer_size=512, dropout_keep_prob=1.0,
is_training=phase_train_placeholder)
# loss
logits = slim.fully_connected(inputs=features, num_outputs=train_data.get_class_count(), activation_fn=None,
weights_initializer=tf.truncated_normal_initializer(stddev=0.1),
weights_regularizer=slim.l2_regularizer(scale=0.00005),
scope='Logits', reuse=False)
tf.losses.sparse_softmax_cross_entropy(labels=labels_placeholder, logits=logits,
reduction=tf.losses.Reduction.MEAN)
loss = tf.losses.get_total_loss()
# normalize output for inference
embeddings = tf.nn.l2_normalize(features, 1, 1e-10, name='embeddings')
# optimizer
optimizer = tf.train.AdamOptimizer()
train_op = optimizer.minimize(loss, global_step=global_step)
这是我的火车步骤:
batch_data, batch_labels = train_data.next_batch()
feed_dict = {
images_placeholder: batch_data,
labels_placeholder: batch_labels,
dropout_placeholder: dropout_keep_prob
}
_, loss_value = sess.run([train_op, loss], feed_dict=feed_dict)
我可以添加有关如何冻结图形的代码,但这并不是必需的.使用 is_train=false 构建图,加载最新的检查点并在 LWF 上运行评估以重现问题就足够了.
I could add the code for how I freeze the graph but it's not really necessary. it's enough to build the graph with is_train=false, load latest checkpoint and run the evaluation on LWF to reproduce the problem.
更新...
我发现问题出在batch normalization层.将此层设置为 is_training=false 即可重现问题.
I found that the problem is in the batch normalization layer. it's enough to set this layer to is_training=false to reproduce the problem.
我找到这个后找到的参考:
references that I found after finding this:
http://ruishu.io/2016/12/27/batchnorm/
https://github.com/tensorflow/tensorflow/issues/10118
一旦我有一个经过测试的解决方案,就会更新一个解决方案.
Will update with a solution once I have a tested one.
推荐答案
于是我找到了解决方案.主要使用这个参考:http://ruishu.io/2016/12/27/batchnorm/
So I found a solution. Mainly using this reference: http://ruishu.io/2016/12/27/batchnorm/
来自链接:
注意:当is_training为True时,moving_mean和moving_variance需要更新,默认情况下update_ops放在tf.GraphKeys.UPDATE_OPS中,因此需要将它们作为依赖添加到train_op中,例如:
Note: When is_training is True the moving_mean and moving_variance need to be updated, by default the update_ops are placed in tf.GraphKeys.UPDATE_OPS so they need to be added as a dependency to the train_op, example:
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) if update_ops: updates = tf.group(*update_ops) total_loss = control_flow_ops.with_dependencies([updates], total_loss)
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) if update_ops: updates = tf.group(*update_ops) total_loss = control_flow_ops.with_dependencies([updates], total_loss)
直截了当,而不是像这样创建优化器:
And to be straight to the point, instead of creating the optimizer like so:
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
train_op = optimizer.minimize(total_loss, global_step=global_step)
这样做:
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
train_op = optimizer.minimize(total_loss, global_step=global_step)
这将解决问题.
这篇关于将 is_training 设置为 false 时,MobileNet 不可用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!