机器学习-使用批次梯度下降的线性回归 [英] Machine learning - Linear regression using batch gradient descent

查看:91
本文介绍了机器学习-使用批次梯度下降的线性回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试对具有单个功能和多个训练示例(m)的数据集实施批量梯度下降.

I am trying to implement batch gradient descent on a data set with a single feature and multiple training examples (m).

当我尝试使用正态方程时,我得到了正确的答案,但下面的代码却给出了错误的答案,下面的代码在MATLAB中执行批量梯度下降.

When I try using the normal equation, I get the right answer but the wrong one with this code below which performs batch gradient descent in MATLAB.

 function [theta] = gradientDescent(X, y, theta, alpha, iterations)
      m = length(y);
      delta=zeros(2,1);
      for iter =1:1:iterations
          for i=1:1:m
              delta(1,1)= delta(1,1)+( X(i,:)*theta - y(i,1))  ;
              delta(2,1)=delta(2,1)+ (( X(i,:)*theta - y(i,1))*X(i,2)) ;
          end
          theta= theta-( delta*(alpha/m) );
        computeCost(X,y,theta)
      end
end

y是具有目标值的向量,X是一个矩阵,其中第一列填充为1,第二列填充为值(变量).

y is the vector with target values, X is a matrix with the first column full of ones and second columns of values (variable).

我已经使用矢量化实现了这一点,即

I have implemented this using vectorization, i.e

theta = theta - (alpha/m)*delta

...其中delta是一个2元素列向量,初始化为零.

... where delta is a 2 element column vector initialized to zeroes.

成本函数J(Theta)1/(2m)*(sum from i=1 to m [(h(theta)-y)^2]).

推荐答案

错误非常简单.您的delta声明应位于第一个for循环内.每次累积训练样本和输出之间的加权差异时,都应该从头开始累积.

The error is very simple. Your delta declaration should be inside the first for loop. Every time you accumulate the weighted differences between the training sample and output, you should start accumulating from the beginning.

通过不执行此操作,您正在做的是累积上次迭代中的错误 ,该错误将先前学习的theta版本的错误考虑在内,这是不正确的.您必须将其放在第一个for循环的开始.

By not doing this, what you're doing is accumulating the errors from the previous iteration which takes the error of the the previous learned version of theta into account which isn't correct. You must put this at the beginning of the first for loop.

此外,您似乎有一个无关紧要的computeCost呼叫.我假设这将在给定当前参数的情况下在每次迭代中计算成本函数,因此我将创建一个名为cost的新输出数组,该数组将在每次迭代中向您显示.我还将调用此函数并将其分配给此数组中的相应元素:

In addition, you seem to have an extraneous computeCost call. I'm assuming this calculates the cost function at every iteration given the current parameters, and so I'm going to create a new output array called cost that shows you this at each iteration. I'm also going to call this function and assign it to the corresponding elements in this array:

function [theta, costs] = gradientDescent(X, y, theta, alpha, iterations)
    m = length(y);
    costs = zeros(m,1); %// New
%    delta=zeros(2,1); %// Remove
    for iter =1:1:iterations
    delta=zeros(2,1); %// Place here
   for i=1:1:m
       delta(1,1)= delta(1,1)+( X(i,:)*theta - y(i,1))  ;
       delta(2,1)=delta(2,1)+ (( X(i,:)*theta - y(i,1))*X(i,2)) ;
   end
    theta= theta-( delta*(alpha/m) );
   costs(iter) = computeCost(X,y,theta); %// New
end
end


有关正确矢量化的说明

FWIW,我不认为此实现完全是矢量化的.您可以使用向量化操作消除第二个for循环.在我们开始之前,让我介绍一些理论,以便我们处于同一页面上.根据线性回归,您在这里使用的是梯度下降.我们想要寻找最佳参数theta,这些参数是我们的线性回归系数,旨在最小化此成本函数:


A note on proper vectorization

FWIW, I don't consider this implementation completely vectorized. You can eliminate the second for loop by using vectorized operations. Before we do that, let me cover some theory so we're on the same page. You are using gradient descent here in terms of linear regression. We want to seek the best parameters theta that are our linear regression coefficients that seek to minimize this cost function:

m对应于我们可用的训练样本数量,而x^{i}对应于第i 训练示例. y^{i}对应于我们与第i th 训练样本相关联的地面真实值. h是我们的假设,它给出为:

m corresponds to the number of training samples we have available and x^{i} corresponds to the ith training example. y^{i} corresponds to the ground truth value we have associated with the ith training sample. h is our hypothesis, and it is given as:

请注意,在2D线性回归的情况下,我们要计算的theta中只有两个值-截距项和斜率.

Note that in the context of linear regression in 2D, we only have two values in theta we want to compute - the intercept term and the slope.

我们可以最小化成本函数J以确定最佳回归系数,该系数可以为我们提供使训练集的误差最小化的最佳预测.具体来说,从一些初始的theta参数开始……通常是零向量,我们将迭代从1迭代到我们认为合适的数量,并且在每次迭代中,我们通过以下关系更新theta参数:

We can minimize the cost function J to determine the best regression coefficients that can give us the best predictions that minimize the error of the training set. Specifically, starting with some initial theta parameters... usually a vector of zeroes, we iterate over iterations from 1 up to as many as we see fit, and at each iteration, we update our theta parameters by this relationship:

对于我们要更新的每个参数,您需要确定成本函数相对于每个变量的梯度,并评估当前状态theta的梯度.如果您使用微积分解决此问题,我们将得到:

For each parameter we want to update, you need to determine the gradient of the cost function with respect to each variable and evaluate what that is at the current state of theta. If you work this out using Calculus, we get:

如果您不清楚这种推导是如何发生的,那么我请您参考一下有关此问题的不错的Mathematics Stack Exchange帖子:

If you're unclear with how this derivation happened, then I refer you to this nice Mathematics Stack Exchange post that talks about it:

> https://math.stackexchange.com/questions /70728/两个变量的渐变的局部导数

现在...我们如何将其应用于当前的问题?具体来说,您可以轻松地一次分析所有样本,轻松计算delta的条目.我的意思是,您可以执行此操作:

Now... how can we apply this to our current problem? Specifically, you can calculate the entries of delta quite easily analyzing all of the samples together in one go. What I mean is that you can just do this:

function [theta, costs] = gradientDescent(X, y, theta, alpha, iterations)
    m = length(y);
    costs = zeros(m,1);
    for iter = 1 : iterations
        delta1 = theta(1) - (alpha/m)*(sum((theta(1)*X(:,1) + theta(2)*X(:,2) - y).*X(:,1)));
        delta2 = theta(2) - (alpha/m)*(sum((theta(1)*X(:,1) + theta(2)*X(:,2) - y).*X(:,2)));

        theta = [delta1; delta2];
        costs(iter) = computeCost(X,y,theta);
    end
end

对于delta(1)delta(2)的操作都可以在单个语句中完全被向量化.对于1, 2, ..., m中的每个样本i,您正在做什么theta^{T}*X^{i}.您可以方便地将其放在单个sum语句中.

The operations on delta(1) and delta(2) can completely be vectorized in a single statement for both. What you are doing theta^{T}*X^{i} for each sample i from 1, 2, ..., m. You can conveniently place this into a single sum statement.

我们甚至可以进一步将其替换为纯矩阵运算.首先,您可以做的是使用矩阵乘法非常快速地为每个输入样本X^{i}计算theta^{T}*X^{i}.假设是否:

We can go even further and replace this with purely matrix operations. First off, what you can do is compute theta^{T}*X^{i} for each input sample X^{i} very quickly using matrix multiplication. Suppose if:

在这里,X是我们的数据矩阵,由与m训练样本相对应的m行和与n特征相对应的n列组成.同样,theta是我们从梯度下降中获悉的权重矢量,其中n+1特征说明了截距项.

Here, X is our data matrix which composes of m rows corresponding to m training samples and n columns corresponding to n features. Similarly, theta is our learned weight vector from gradient descent with n+1 features accounting for the intercept term.

如果我们计算X*theta,我们将得到:

If we compute X*theta, we get:

正如您在此处看到的,我们已经计算了每个样本的假设并将其放入向量中.该向量的每个元素都是第i个训练样本的假设.现在,回想一下每个参数在梯度下降中的梯度项是什么:

As you can see here, we have computed the hypothesis for each sample and have placed each into a vector. Each element of this vector is the hypothesis for the ith training sample. Now, recall what the gradient term of each parameter is in gradient descent:

我们希望对您学习到的向量中的所有参数全部实现,因此将其放入向量中可以得到:

We want to implement this all in one go for all of the parameters in your learned vector, and so putting this into a vector gives us:

最后:

因此,我们知道y已经是长度为m的向量,因此我们可以通过以下方式非常紧凑地计算梯度下降:

Therefore, we know that y is already a vector of length m, and so we can very compactly compute gradient descent at each iteration by:

theta = theta - (alpha/m)*X'*(X*theta - y);

....所以您的代码现在只是:

.... so your code is now just:

function [theta, costs] = gradientDescent(X, y, theta, alpha, iterations)
    m = length(y);
    costs = zeros(m, 1);
    for iter = 1 : iterations
        theta = theta - (alpha/m)*X'*(X*theta - y);
        costs(iter) = computeCost(X,y,theta);
    end
end

这篇关于机器学习-使用批次梯度下降的线性回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆