随机梯度下降的批量大小是训练数据的长度而不是 1? [英] Batch size for Stochastic gradient descent is length of training data and not 1?

查看:22
本文介绍了随机梯度下降的批量大小是训练数据的长度而不是 1?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在使用批量梯度下降、随机梯度下降和小批量随机梯度下降时绘制不同的学习结果.

I am trying to plot the different learning outcome when using Batch gradient descent, Stochastic gradient descent and mini-batch stochastic gradient descent.

无论我在哪里,我都读到 batch_size=1 与普通 SGD 相同,batch_size=len(train_data) 与 Batch 梯度下降相同.

Everywhere i look, i read that a batch_size=1 is the same as having a plain SGD and a batch_size=len(train_data) is the same as having the Batch gradient descent.

我知道随机梯度下降是指每次更新只使用一个数据样本,批量梯度下降使用整个训练数据集来计算目标函数/更新的梯度.

I know that stochastic gradient descent is when you use only one single data sample for every update and batch gradient descent uses the entire training data set to compute the gradient of the objective function / update.

然而,当使用 keras 实现 batch_size 时,情况似乎正好相反.以我的代码为例,我将 batch_size 设置为等于我的 training_data 的长度

However, when implementing the batch_size using keras, it seems to be the opposite that is happening. Take my code for example, where I have set the batch_size equal to the length of my training_data

input_size = len(train_dataset.keys())
output_size = 10
hidden_layer_size = 250
n_epochs = 250

weights_initializer = keras.initializers.GlorotUniform()

#A function that trains and validates the model and returns the MSE
def train_val_model(run_dir, hparams):
    model = keras.models.Sequential([
            #Layer to be used as an entry point into a Network
            keras.layers.InputLayer(input_shape=[len(train_dataset.keys())]),
            #Dense layer 1
            keras.layers.Dense(hidden_layer_size, activation='relu', 
                               kernel_initializer = weights_initializer,
                               name='Layer_1'),
            #Dense layer 2
            keras.layers.Dense(hidden_layer_size, activation='relu', 
                               kernel_initializer = weights_initializer,
                               name='Layer_2'),
            #activation function is linear since we are doing regression
            keras.layers.Dense(output_size, activation='linear', name='Output_layer')
                                ])
    
    #Use the stochastic gradient descent optimizer but change batch_size to get BSG, SGD or MiniSGD
    optimizer = tf.keras.optimizers.SGD(learning_rate=0.001, momentum=0.0,
                                        nesterov=False)
    
    #Compiling the model
    model.compile(optimizer=optimizer, 
                  loss='mean_squared_error', #Computes the mean of squares of errors between labels and predictions
                  metrics=['mean_squared_error']) #Computes the mean squared error between y_true and y_pred
    
    # initialize TimeStopping callback 
    time_stopping_callback = tfa.callbacks.TimeStopping(seconds=5*60, verbose=1)
    
    #Training the network
    history = model.fit(normed_train_data, train_labels, 
         epochs=n_epochs,
         batch_size=hparams['batch_size'], 
         verbose=1,
         #validation_split=0.2,
         callbacks=[tf.keras.callbacks.TensorBoard(run_dir + "/Keras"), time_stopping_callback])
    
    return history

train_val_model("logs/sample", {'batch_size': len(normed_train_data)})

运行时,输出似乎显示每个时期的单个更新,即 SGD:

When running this, the output seems to show a single update for each epoch i.e. SGD :

正如在每个纪元下面可以看到的那样,它表示 1/1,我认为这意味着单个更新迭代.另一方面,如果我将 batch_size=1 设置为 90000/90000,这是我的整个数据集的大小(在训练时间方面这也是有道理的).

As can be seen underneath every epoch it says 1/1 which I assume means a single update iteration. If I on the other hand set the batch_size=1 I get 90000/90000 which is the size of my entire data-set (training time wise this also makes sense).

所以,我的问题是,batch_size=1 实际上是批量梯度下降而不是随机梯度下降,batch_size=len(train_data) 实际上是随机梯度下降而不是批量梯度下降?

So, my question is, batch_size=1 is actually Batch gradient descent and not stochastic gradient descent and batch_size=len(train_data) is actually stochastic gradient descent and not batch gradient descent?

推荐答案

实际上有三 (3) 种情况:

There are actually three (3) cases:

  • batch_size = 1 表示确实是随机梯度下降 (SGD)
  • 一个batch_size等于整个训练数据是(batch)梯度下降(GD)
  • 中间情况(在实践中实际使用)通常被称为小批量梯度下降
  • batch_size = 1 means indeed stochastic gradient descent (SGD)
  • A batch_size equal to the whole of the training data is (batch) gradient descent (GD)
  • Intermediate cases (which are actually used in practice) are usually referred to as mini-batch gradient descent

请参阅小批量简介梯度下降和如何配置批量大小以获取更多详细信息和参考.事实是,在实践中,当我们说SGD"时我们通常指的是小批量 SGD".

See A Gentle Introduction to Mini-Batch Gradient Descent and How to Configure Batch Size for more details and references. Truth is, in practice, when we say "SGD" we usually mean "mini-batch SGD".

这些定义实际上完全符合您从实验中报告的内容:

These definitions are in fact fully compliant with what you report from your experiments:

  • 使用 batch_size=len(train_data)(GD 案例),每个 epoch 确实预计只有 一个 更新(因为只有一个批次),因此Keras 输出中的 1/1 指示.

  • With batch_size=len(train_data) (GD case), only one update is indeed expected per epoch (since there is only one batch), hence the 1/1 indication in Keras output.

相比之下,对于 batch_size = 1(SGD 情况),您期望与训练数据中的样本一样多的更新(因为现在这是您的批次数),即 90000,因此 Keras 输出中的 90000/90000 指示.

In contrast, with batch_size = 1 (SGD case), you expect as many updates as samples in your training data (since this is now the number of your batches), i.e. 90000, hence the 90000/90000 indication in Keras output.

即每个 epoch 的更新次数(Keras 表示)等于使用的批次数( 等于批次大小).

i.e. the number of updates per epoch (which Keras indicates) is equal to the number of batches used (and not to the batch size).

这篇关于随机梯度下降的批量大小是训练数据的长度而不是 1?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆