为什么 TensorFlow 示例在增加批量大小时会失败? [英] Why does TensorFlow example fail when increasing batch size?

查看:22
本文介绍了为什么 TensorFlow 示例在增加批量大小时会失败?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在查看 面向初学者的 Tensorflow MNIST 示例,发现在这部分:

I was looking at the Tensorflow MNIST example for beginners and found that in this part:

for i in range(1000):
  batch_xs, batch_ys = mnist.train.next_batch(100)
  sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

将批大小从 100 更改为 204 以上会导致模型无法收敛.它最多可以工作到 204,但是在 205 和我尝试过的任何更高的数字时,准确度最终会<10%.这是一个错误,是算法的问题,还是其他问题?

changing the batch size from 100 to be above 204 causes the model to fail to converge. It works up to 204, but at 205 and any higher number I tried, the accuracy would end up < 10%. Is this a bug, something about the algorithm, something else?

这是为 OS X 运行他们的二进制安装,似乎是 0.5.0 版.

This is running their binary installation for OS X, seems to be version 0.5.0.

推荐答案

您在初学者示例中使用的是非常基本的线性模型吗?

You're using the very basic linear model in the beginners example?

这是调试它的一个技巧 - 在增加批大小时观察交叉熵(第一行来自示例,第二行是我刚刚添加的):

Here's a trick to debug it - watch the cross-entropy as you increase the batch size (the first line is from the example, the second I just added):

cross_entropy = -tf.reduce_sum(y_*tf.log(y))
cross_entropy = tf.Print(cross_entropy, [cross_entropy], "CrossE")

批量大小为 204 时,您会看到:

At a batch size of 204, you'll see:

I tensorflow/core/kernels/logging_ops.cc:64] CrossE[92.37558]
I tensorflow/core/kernels/logging_ops.cc:64] CrossE[90.107414]

但是在 205 处,您会从一开始就看到这样的序列:

But at 205, you'll see a sequence like this, from the start:

I tensorflow/core/kernels/logging_ops.cc:64] CrossE[472.02966]
I tensorflow/core/kernels/logging_ops.cc:64] CrossE[475.11697]
I tensorflow/core/kernels/logging_ops.cc:64] CrossE[1418.6655]
I tensorflow/core/kernels/logging_ops.cc:64] CrossE[1546.3833]
I tensorflow/core/kernels/logging_ops.cc:64] CrossE[1684.2932]
I tensorflow/core/kernels/logging_ops.cc:64] CrossE[1420.02]
I tensorflow/core/kernels/logging_ops.cc:64] CrossE[1796.0872]
I tensorflow/core/kernels/logging_ops.cc:64] CrossE[nan]

Ack - NaN 出现了.基本上,大批量会产生如此巨大的梯度,以至于您的模型失控 - 它应用的更新太大,并且超出了它应该走的方向.

Ack - NaN's showing up. Basically, the large batch size is creating such a huge gradient that your model is spiraling out of control -- the updates it's applying are too large, and overshooting the direction it should go by a huge margin.

在实践中,有几种方法可以解决这个问题.您可以将学习率从 0.01 降低到 0.005,这会导致最终准确度为 0.92.

In practice, there are a few ways to fix this. You could reduce the learning rate from .01 to, say, .005, which results in a final accuracy of 0.92.

train_step = tf.train.GradientDescentOptimizer(0.005).minimize(cross_entropy)

或者您可以使用更复杂的优化算法(Adam、Momentum 等),尝试做更多的事情来确定梯度的方向.或者,您可以使用更复杂的模型,该模型具有更多的自由参数来分散大梯度.

Or you could use a more sophisticated optimization algorithm (Adam, Momentum, etc.) that tries to do more to figure out the direction of the gradient. Or you could use a more complex model that has more free parameters across which to disperse that big gradient.

这篇关于为什么 TensorFlow 示例在增加批量大小时会失败?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆