如何更新两层多层感知器中的学习率? [英] How to update the learning rate in a two layered multi-layered perceptron?

查看:116
本文介绍了如何更新两层多层感知器中的学习率?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出异或问题:

X = xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = xor_output = np.array([[0,1,1,0]]).T

一个简单的

  • 具有
  • 的两层多层感知器(MLP) 它们与之间的
  • S形激活
  • 均方误差(MSE)作为损失函数/优化准则
  • two layered Multi-Layered Perceptron (MLP) with
  • sigmoid activations between them and
  • Mean Square Error (MSE) as the loss function/optimization criterion

[代码]:

def sigmoid(x): # Returns values that sums to one.
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(sx): # For backpropagation.
    # See https://math.stackexchange.com/a/1225116
    return sx * (1 - sx)

# Cost functions.
def mse(predicted, truth):
    return np.sum(np.square(truth - predicted))

X = xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = xor_output = np.array([[0,1,1,0]]).T

# Define the shape of the weight vector.
num_data, input_dim = X.shape
# Lets set the dimensions for the intermediate layer.
hidden_dim = 5
# Initialize weights between the input layers and the hidden layer.
W1 = np.random.random((input_dim, hidden_dim))

# Define the shape of the output vector. 
output_dim = len(Y.T)
# Initialize weights between the hidden layers and the output layer.
W2 = np.random.random((hidden_dim, output_dim))

并给出停止标准为固定编号.固定学习率为0.3的时期(通过X和Y进行的迭代次数):

And given the stopping criteria as a fixed no. of epochs (no. of iterations through the X and Y) with a fixed learning rate of 0.3:

# Initialize weigh
num_epochs = 10000
learning_rate = 0.3

当我进行前后传播并在每个时期更新权重时,我应该如何更新权重?

When I run through the forward-backward propagation and update the weights in each epoch, how should I update the weights?

我试图简单地将学习率的乘积与反向传播导数的点积与层输出相加,但是该模型仍然仅在一个方向上更新了权重,导致所有权重降低到接近零.

I tried to simply add the product of the learning rate with the dot product of the backpropagated derivative with the layer outputs but the model still only updated the weights in one direction causing all the weights to degrade to near zero.

for epoch_n in range(num_epochs):
    layer0 = X
    # Forward propagation.

    # Inside the perceptron, Step 2. 
    layer1 = sigmoid(np.dot(layer0, W1))
    layer2 = sigmoid(np.dot(layer1, W2))

    # Back propagation (Y -> layer2)

    # How much did we miss in the predictions?
    layer2_error = mse(layer2, Y)

    #print(layer2_error)
    # In what direction is the target value?
    # Were we really close? If so, don't change too much.
    layer2_delta = layer2_error * sigmoid_derivative(layer2)

    # Back propagation (layer2 -> layer1)
    # How much did each layer1 value contribute to the layer2 error (according to the weights)?
    layer1_error = np.dot(layer2_delta, W2.T)
    layer1_delta = layer1_error * sigmoid_derivative(layer1)

    # update weights
    W2 += - learning_rate * np.dot(layer1.T, layer2_delta)
    W1 += - learning_rate * np.dot(layer0.T, layer1_delta)
    #print(np.dot(layer0.T, layer1_delta))
    #print(epoch_n, list((layer2)))

    # Log the loss value as we proceed through the epochs.
    losses.append(layer2_error.mean())

权重应如何正确更新?

完整代码:

from itertools import chain
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(0)

def sigmoid(x): # Returns values that sums to one.
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(sx):
    # See https://math.stackexchange.com/a/1225116
    return sx * (1 - sx)

# Cost functions.
def mse(predicted, truth):
    return np.sum(np.square(truth - predicted))

X = xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = xor_output = np.array([[0,1,1,0]]).T

# Define the shape of the weight vector.
num_data, input_dim = X.shape
# Lets set the dimensions for the intermediate layer.
hidden_dim = 5
# Initialize weights between the input layers and the hidden layer.
W1 = np.random.random((input_dim, hidden_dim))

# Define the shape of the output vector. 
output_dim = len(Y.T)
# Initialize weights between the hidden layers and the output layer.
W2 = np.random.random((hidden_dim, output_dim))

# Initialize weigh
num_epochs = 10000
learning_rate = 0.3

losses = []

for epoch_n in range(num_epochs):
    layer0 = X
    # Forward propagation.

    # Inside the perceptron, Step 2. 
    layer1 = sigmoid(np.dot(layer0, W1))
    layer2 = sigmoid(np.dot(layer1, W2))

    # Back propagation (Y -> layer2)

    # How much did we miss in the predictions?
    layer2_error = mse(layer2, Y)

    #print(layer2_error)
    # In what direction is the target value?
    # Were we really close? If so, don't change too much.
    layer2_delta = layer2_error * sigmoid_derivative(layer2)

    # Back propagation (layer2 -> layer1)
    # How much did each layer1 value contribute to the layer2 error (according to the weights)?
    layer1_error = np.dot(layer2_delta, W2.T)
    layer1_delta = layer1_error * sigmoid_derivative(layer1)

    # update weights
    W2 += - learning_rate * np.dot(layer1.T, layer2_delta)
    W1 += - learning_rate * np.dot(layer0.T, layer1_delta)
    #print(np.dot(layer0.T, layer1_delta))
    #print(epoch_n, list((layer2)))

    # Log the loss value as we proceed through the epochs.
    losses.append(layer2_error.mean())

# Visualize the losses
plt.plot(losses)
plt.show()

我在反向传播过程中缺少任何东西吗?

也许我错过了从成本到第二层的导数?

我意识到我错过了从成本到第二层以及添加后的偏导数:

I realized I missed the partial derivative from the cost to the second layer and after adding it:

# Cost functions.
def mse(predicted, truth):
    return 0.5 * np.sum(np.square(predicted - truth)).mean()

def mse_derivative(predicted, truth):
    return predicted - truth

具有跨各个时期的更新的反向传播循环:

With the updated backpropagation loop across epochs:

for epoch_n in range(num_epochs):
    layer0 = X
    # Forward propagation.

    # Inside the perceptron, Step 2. 
    layer1 = sigmoid(np.dot(layer0, W1))
    layer2 = sigmoid(np.dot(layer1, W2))

    # Back propagation (Y -> layer2)

    # How much did we miss in the predictions?
    cost_error = mse(layer2, Y)
    cost_delta = mse_derivative(layer2, Y)

    #print(layer2_error)
    # In what direction is the target value?
    # Were we really close? If so, don't change too much.
    layer2_error = np.dot(cost_delta, cost_error)
    layer2_delta = layer2_error *  sigmoid_derivative(layer2)

    # Back propagation (layer2 -> layer1)
    # How much did each layer1 value contribute to the layer2 error (according to the weights)?
    layer1_error = np.dot(layer2_delta, W2.T)
    layer1_delta = layer1_error * sigmoid_derivative(layer1)

    # update weights
    W2 += - learning_rate * np.dot(layer1.T, layer2_delta)
    W1 += - learning_rate * np.dot(layer0.T, layer1_delta)

这似乎是在训练和学习XOR ...

It seemed to train and learn the XOR...

但是现在问题来了,layer2_errorlayer2_delta是否正确计算,即以下代码部分正确吗?

But now the question begets, is the layer2_error and layer2_delta computed correctly, i.e. is the following part of the code correct?

# How much did we miss in the predictions?
cost_error = mse(layer2, Y)
cost_delta = mse_derivative(layer2, Y)

#print(layer2_error)
# In what direction is the target value?
# Were we really close? If so, don't change too much.
layer2_error = np.dot(cost_delta, cost_error)
layer2_delta = layer2_error *  sigmoid_derivative(layer2)

cost_deltacost_error上为layer2_error做点积是否正确?还是layer2_error等于cost_delta?

Is it correct to do a dot product on the cost_delta and cost_error for the layer2_error? Or would layer2_error just be equals to cost_delta?

# How much did we miss in the predictions?
cost_error = mse(layer2, Y)
cost_delta = mse_derivative(layer2, Y)

#print(layer2_error)
# In what direction is the target value?
# Were we really close? If so, don't change too much.
layer2_error = cost_delta
layer2_delta = layer2_error *  sigmoid_derivative(layer2)

推荐答案

是的,在更新权重时,将残差(cost_error)与增量值相乘是正确的.

Yes, it is correct to multiply the residuals (cost_error) with the delta values when we update the weights.

但是,因为cost_error是标量,所以是否进行点积并不重要.因此,一个简单的乘法就足够了.但是,我们绝对必须乘以成本函数的梯度,因为这是我们开始反向传播的地方(即它是向后传递的入口).

However, it doesn't really matter whether do dot product or not since it cost_error is a scalar. So, a simple multiplication is enough. But, we definitely have to multiply the gradient of the cost function because that's where we start our backprop (i.e. it's the entry point for backward pass).

此外,可以简化以下功能:

Also, the below function can be simplified:

def mse(predicted, truth):
    return 0.5 * np.sum(np.square(predicted - truth)).mean()

def mse(predicted, truth):
    return 0.5 * np.mean(np.square(predicted - truth))

这篇关于如何更新两层多层感知器中的学习率?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆