随机梯度下降甚至可以与TensorFlow一起使用吗? [英] Does Stochastic Gradient Descent even work with TensorFlow?

查看:142
本文介绍了随机梯度下降甚至可以与TensorFlow一起使用吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我设计了一个完全连接的MLP,具有2个隐藏层和1个输出层. 如果使用批处理或小批处理梯度下降,则学习曲线很好.

I designed a MLP, fully connected, with 2 hidden and one output layer. I get a nice learning curve if I use batch or mini-batch gradient descent.

但是执行随机梯度下降时的直线(紫色)

But a straight line while performing Stochastic Gradient Descent (violet)

我怎么了?

根据我的理解,如果我在每个火车步骤中仅提供一个火车/学习示例,那么我会使用Tensorflow进行随机梯度下降,例如:

In my understanding, I do stochastic gradient descent with Tensorflow, if I provide just one train/learn example each train step, like:

X = tf.placeholder("float", [None, amountInput],name="Input")
Y = tf.placeholder("float", [None, amountOutput],name="TeachingInput")
...
m, i = sess.run([merged, train_op], feed_dict={X:[input],Y:[label]})

输入是10分量向量,label是20分量向量.

Whereby input is a 10-component vector and label is a 20-component vector.

对于测试,我运行1000次迭代,每个迭代包含50个准备好的训练/学习示例之一. 我预料到过头了.但是如您所见,它不会学习:(

For testings I run 1000 iterations, each iterations contains one of 50 prepared train/learn example. I expected an overfittet nn. But as you see, it doesn't learn :(

由于nn将在在线学习环境中执行,因此无法选择小批量或多批次梯度下降.

Because the nn will perform in an online-learning environment, a mini-batch oder batch gradient descent isn't an option.

感谢任何提示.

推荐答案

批量大小会影响有效学习率.

The batch size influences the effective learning rate.

如果您考虑单个参数的更新公式,则会发现它是针对输入批处理中的每个元素,平均为此参数计算的各种值的更新.

If you think to the update formula of a single parameter, you'll see that it's updated averaging the various values computed for this parameter, for every element in the input batch.

这意味着,如果您使用的批量大小为n,则每个参数的实际"学习率约为learning_rate/n.

This means that if you're working with a batch size with size n, your "real" learning rate per single parameter is about learning_rate/n.

因此,如果您使用n批次的批次训练的模型没有问题,那是因为对于该批次大小而言,学习率还可以.

Thus, if the model you've trained with batches of size n have trained without issues, this is because the learning rate was ok for that batch size.

如果使用纯随机梯度下降法,则必须降低学习率(通常降低10倍的幂).

If you use pure stochastic gradient descent, you have to lower the learning rate (usually by a factor of some power of 10).

因此,例如,如果您的学习率为1e-4且批量大小为128,请尝试将学习率设为1e-4 / 128.0,以查看网络是否可以学习(应该).

So, for example, if your learning rate was 1e-4 with a batch size of 128, try with a learning rate of 1e-4 / 128.0 as see if the network learn (it should).

这篇关于随机梯度下降甚至可以与TensorFlow一起使用吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆