进行梯度下降时检查梯度 [英] Checking the gradient when doing gradient descent

查看:76
本文介绍了进行梯度下降时检查梯度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试实现前馈反向传播自动编码器(梯度下降训练),并想验证我是否正确计算了梯度.此教程建议一次计算每个参数的导数:grad_i(theta)=(J(theta_i + epsilon)-J(theta_i-epsilon))/(2 * epsilon).我已经在Matlab中编写了一个示例代码来执行此操作,但是运气不佳-从导数计算出来的梯度和从数值上发现的梯度之间的差异往往比较大(>> 4位有效数字).

如果任何人都可以提供任何建议,我将非常感谢您的帮助(无论是在计算梯度还是在执行检查方面).因为我已经极大地简化了代码以使其更具可读性,所以我没有添加任何偏见,并且不再使用权重矩阵.

首先,我初始化变量:

  numHidden = 200;numVisible = 784;低= -4 * sqrt(6 ./(numHidden + numVisible));高= 4 * sqrt(6 ./((numHidden + numVisible)));编码器=低+(高-低)* rand(numVisible,numHidden);解码器=低+(高-低)* rand(numHidden,numVisible); 

接下来,给定一些输入图像 x ,进行前馈传播:

  a = Sigmoid(x * encoder);z = sigmoid(a *解码器);%(x的重建) 

我正在使用的损失函数是标准Σ(0.5 *(z-x)^ 2)):

 %首先通过找到sum(0.5 *(z-x).^ 2)的导数来计算误差.%,它是(f(h)-x)* f'(h),其中z = f(h),h = a *解码器,并且%f = S形(x).但是,由于乙状结肠的导数是%sigmoid *(1-Sigmoid),我们得到:error_0 =(z-x).* z.*(1-z);%梯度\ Delta w_ {ji} = error_j * a_igDecoder = error_0'* a;%不重要,但为完整起见包含%向下传播一层错误_1 =(错误_0 *编码器).* a.*(1-a);gEncoder = error_1'* x; 

最后,检查梯度是否正确(在这种情况下,只需对解码器执行此操作即可):

  epsilon = 10e-5;检查= gDecoder(:);我们上面获得的值的百分比对于i = 1:size(decoder(:),1)%计算J +theta =解码器(:);展开百分比theta(i)= theta(i)+ epsilon;解码器p = reshape(theta,size(decoder));重新投放百分比a = sigmoid(x *编码器);z = sigmoid(a * decoderp);Jp = sum(0.5 *(z-x).^ 2);%计算J-theta =解码器(:);theta(i)= theta(i)-epsilon;解码器p = reshape(theta,size(decoder));a = sigmoid(x *编码器);z = sigmoid(a * decoderp);Jm = sum(0.5 *(z-x).^ 2);grad_i =(Jp-Jm)/(2 * epsilon);diff = abs(grad_i-check(i));fprintf('%d:%f< =>%f:%f \ n',i,grad_i,check(i),diff);结尾 

在MNIST数据集上(对于第一个条目)运行此操作,得出的结果如下:

  2:0.093885< =>0.028398:0.0654873:0.066285< =>0.031096:0.0351895:0.053074< =>0.019839:0.0332356:0.108249< =>0.042407:0.0658437:0.091576< =>0.009014:0.082562 

解决方案

在a和z上均不采用S型.只需在z上使用它即可.

  a = x * encoder;z = sigmoid(a * decoderp); 

I'm trying to implement a feed-forward backpropagating autoencoder (training with gradient descent) and wanted to verify that I'm calculating the gradient correctly. This tutorial suggests calculating the derivative of each parameter one at a time: grad_i(theta) = (J(theta_i+epsilon) - J(theta_i-epsilon)) / (2*epsilon). I've written a sample piece of code in Matlab to do just this, but without much luck -- the differences between the gradient calculated from the derivative and the gradient numerically found tend to be largish (>> 4 significant figures).

If anyone can offer any suggestions, I would greatly appreciate the help (either in my calculation of the gradient or how I perform the check). Because I've simplified the code greatly to make it more readable, I haven't included a biases, and am no longer tying the weight matrices.

First, I initialize the variables:

numHidden = 200;
numVisible = 784;
low = -4*sqrt(6./(numHidden + numVisible));
high = 4*sqrt(6./(numHidden + numVisible));
encoder = low + (high-low)*rand(numVisible, numHidden);
decoder = low + (high-low)*rand(numHidden, numVisible);

Next, given some input image x, do feed-forward propagation:

a = sigmoid(x*encoder);
z = sigmoid(a*decoder); % (reconstruction of x)

The loss function I'm using is the standard Σ(0.5*(z - x)^2)):

% first calculate the error by finding the derivative of sum(0.5*(z-x).^2), 
% which is (f(h)-x)*f'(h), where z = f(h), h = a*decoder, and 
% f = sigmoid(x). However, since the derivative of the sigmoid is 
% sigmoid*(1 - sigmoid), we get:
error_0 = (z - x).*z.*(1-z);

% The gradient \Delta w_{ji} = error_j*a_i
gDecoder = error_0'*a;

% not important, but included for completeness
% do back-propagation one layer down
error_1 = (error_0*encoder).*a.*(1-a);
gEncoder = error_1'*x;

And finally, check that the gradient is correct (in this case, just do it for the decoder):

epsilon = 10e-5;
check = gDecoder(:); % the values we obtained above
for i = 1:size(decoder(:), 1)
    % calculate J+
    theta = decoder(:); % unroll
    theta(i) = theta(i) + epsilon;
    decoderp = reshape(theta, size(decoder)); % re-roll
    a = sigmoid(x*encoder);
    z = sigmoid(a*decoderp);
    Jp = sum(0.5*(z - x).^2);

    % calculate J-
    theta = decoder(:);
    theta(i) = theta(i) - epsilon;
    decoderp = reshape(theta, size(decoder));
    a = sigmoid(x*encoder);
    z = sigmoid(a*decoderp);
    Jm = sum(0.5*(z - x).^2);

    grad_i = (Jp - Jm) / (2*epsilon);
    diff = abs(grad_i - check(i));
    fprintf('%d: %f <=> %f: %f\n', i, grad_i, check(i), diff);
end

Running this on the MNIST dataset (for the first entry) gives results such as:

2: 0.093885 <=> 0.028398: 0.065487
3: 0.066285 <=> 0.031096: 0.035189
5: 0.053074 <=> 0.019839: 0.033235
6: 0.108249 <=> 0.042407: 0.065843
7: 0.091576 <=> 0.009014: 0.082562

解决方案

Do not sigmoid on both a and z. Just use it on z.

a = x*encoder;
z = sigmoid(a*decoderp);

这篇关于进行梯度下降时检查梯度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆