机器学习 - 使用批量梯度下降的线性回归 [英] Machine learning - Linear regression using batch gradient descent

查看:23
本文介绍了机器学习 - 使用批量梯度下降的线性回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在具有单个特征和多个训练示例 (m) 的数据集上实现批量梯度下降.

当我尝试使用正规方程时,我得到了正确的答案,但错误的答案是下面这段在 MATLAB 中执行批量梯度下降的代码.

 function [theta] = gradientDescent(X, y, theta, alpha, iterations)m = 长度(y);德尔塔=零(2,1);对于 iter =1:1:iterations对于 i=1:1:mdelta(1,1)= delta(1,1)+( X(i,:)*theta - y(i,1)) ;delta(2,1)=delta(2,1)+ (( X(i,:)*theta - y(i,1))*X(i,2)) ;结尾theta= theta-( delta*(alpha/m) );计算成本(X,y,theta)结尾结尾

y 是带有目标值的向量,X 是一个矩阵,第一列是 1,第二列是值(变量).

我已经使用矢量化实现了这一点,即

theta = theta - (alpha/m)*delta

... 其中 delta 是初始化为零的 2 元素列向量.

成本函数J(Theta)1/(2m)*(sum from i=1 to m [(h(theta)-y)^2]).

解决方案

错误很简单.您的 delta 声明应该在第一个 for 循环内.每次累积训练样本和输出之间的加权差异时,都应该从头开始累积.

如果不这样做,您所做的就是累积上次迭代的误差,它考虑了先前学习的 theta 版本的误差,其中不正确.您必须把它放在第一个 for 循环的开头.

此外,您似乎有一个无关的 computeCost 调用.我假设这会在给定当前参数的情况下计算每次迭代的成本函数,因此我将创建一个名为 cost 的新输出数组,在每次迭代时向您显示这一点.我还要调用这个函数并将它分配给这个数组中的相应元素:

function [theta, cost] = gradientDescent(X, y, theta, alpha, iterations)m = 长度(y);成本 = 零 (m,1);%//新的% delta=zeros(2,1);%//消除对于 iter =1:1:iterations德尔塔=零(2,1);%//放在这里对于 i=1:1:mdelta(1,1)= delta(1,1)+( X(i,:)*theta - y(i,1)) ;delta(2,1)=delta(2,1)+ (( X(i,:)*theta - y(i,1))*X(i,2)) ;结尾theta= theta-( delta*(alpha/m) );成本(迭代器)=计算成本(X,y,theta);%//新的结尾结尾

<小时>

关于正确矢量化的说明

FWIW,我不认为这个实现完全矢量化.您可以使用矢量化操作消除第二个 for 循环.在我们这样做之前,让我介绍一些理论,以便我们达成共识.在线性回归方面,您在这里使用梯度下降.我们想要寻找最好的参数 theta,它们是我们的线性回归系数,旨在最小化此成本函数:

m 对应我们可用的训练样本数量,x^{i} 对应第 ith 个训练样本.y^{i} 对应于我们与第 ith 个训练样本相关联的真实值.h 是我们的假设,它被给出为:

请注意,在二维线性回归的上下文中,我们要计算的 theta 中只有两个值 - 截距项和斜率.

我们可以最小化成本函数 J 来确定最佳回归系数,这些系数可以为我们提供最佳预测,从而最小化训练集的误差.具体来说,从一些初始的 theta 参数开始......通常是一个零向量,我们从 1 迭代到我们认为合适的次数,并且在每次迭代时,我们更新我们的 theta 参数按这种关系:

对于我们想要更新的每个参数,您需要确定成本函数相对于每个变量的梯度,并评估 theta 的当前状态.如果你使用微积分来解决这个问题,我们会得到:

如果你不清楚这个推导是如何发生的,那么我向你推荐这篇关于它的好数学堆栈交换帖子:

这里,X是我们的数据矩阵,由m行对应m个训练样本和n组成对应于 n 个特征的列.类似地,theta 是我们从梯度下降中学习到的权重向量,其中 n+1 个特征解释了截距项.

如果我们计算X*theta,我们得到:

正如您在此处看到的,我们已经计算了每个样本的假设并将每个样本放入一个向量中.这个向量的每个元素都是第 ith 个训练样本的假设.现在,回忆一下梯度下降中每个参数的梯度项是什么:

我们希望一次性为您学习的向量中的所有参数实现这一切,因此将其放入向量中给我们:

最后:

因此,我们知道 y 已经是一个长度为 m 的向量,因此我们可以通过以下方式非常紧凑地计算每次迭代的梯度下降:

theta = theta - (alpha/m)*X'*(X*theta - y);

.... 所以你的代码现在只是:

function [theta, cost] = gradientDescent(X, y, theta, alpha, iterations)m = 长度(y);成本 = 零 (m, 1);对于 iter = 1 :迭代theta = theta - (alpha/m)*X'*(X*theta - y);成本(迭代器)=计算成本(X,y,theta);结尾结尾

I am trying to implement batch gradient descent on a data set with a single feature and multiple training examples (m).

When I try using the normal equation, I get the right answer but the wrong one with this code below which performs batch gradient descent in MATLAB.

 function [theta] = gradientDescent(X, y, theta, alpha, iterations)
      m = length(y);
      delta=zeros(2,1);
      for iter =1:1:iterations
          for i=1:1:m
              delta(1,1)= delta(1,1)+( X(i,:)*theta - y(i,1))  ;
              delta(2,1)=delta(2,1)+ (( X(i,:)*theta - y(i,1))*X(i,2)) ;
          end
          theta= theta-( delta*(alpha/m) );
        computeCost(X,y,theta)
      end
end

y is the vector with target values, X is a matrix with the first column full of ones and second columns of values (variable).

I have implemented this using vectorization, i.e

theta = theta - (alpha/m)*delta

... where delta is a 2 element column vector initialized to zeroes.

The cost function J(Theta) is 1/(2m)*(sum from i=1 to m [(h(theta)-y)^2]).

解决方案

The error is very simple. Your delta declaration should be inside the first for loop. Every time you accumulate the weighted differences between the training sample and output, you should start accumulating from the beginning.

By not doing this, what you're doing is accumulating the errors from the previous iteration which takes the error of the the previous learned version of theta into account which isn't correct. You must put this at the beginning of the first for loop.

In addition, you seem to have an extraneous computeCost call. I'm assuming this calculates the cost function at every iteration given the current parameters, and so I'm going to create a new output array called cost that shows you this at each iteration. I'm also going to call this function and assign it to the corresponding elements in this array:

function [theta, costs] = gradientDescent(X, y, theta, alpha, iterations)
    m = length(y);
    costs = zeros(m,1); %// New
%    delta=zeros(2,1); %// Remove
    for iter =1:1:iterations
    delta=zeros(2,1); %// Place here
   for i=1:1:m
       delta(1,1)= delta(1,1)+( X(i,:)*theta - y(i,1))  ;
       delta(2,1)=delta(2,1)+ (( X(i,:)*theta - y(i,1))*X(i,2)) ;
   end
    theta= theta-( delta*(alpha/m) );
   costs(iter) = computeCost(X,y,theta); %// New
end
end


A note on proper vectorization

FWIW, I don't consider this implementation completely vectorized. You can eliminate the second for loop by using vectorized operations. Before we do that, let me cover some theory so we're on the same page. You are using gradient descent here in terms of linear regression. We want to seek the best parameters theta that are our linear regression coefficients that seek to minimize this cost function:

m corresponds to the number of training samples we have available and x^{i} corresponds to the ith training example. y^{i} corresponds to the ground truth value we have associated with the ith training sample. h is our hypothesis, and it is given as:

Note that in the context of linear regression in 2D, we only have two values in theta we want to compute - the intercept term and the slope.

We can minimize the cost function J to determine the best regression coefficients that can give us the best predictions that minimize the error of the training set. Specifically, starting with some initial theta parameters... usually a vector of zeroes, we iterate over iterations from 1 up to as many as we see fit, and at each iteration, we update our theta parameters by this relationship:

For each parameter we want to update, you need to determine the gradient of the cost function with respect to each variable and evaluate what that is at the current state of theta. If you work this out using Calculus, we get:

If you're unclear with how this derivation happened, then I refer you to this nice Mathematics Stack Exchange post that talks about it:

https://math.stackexchange.com/questions/70728/partial-derivative-in-gradient-descent-for-two-variables

Now... how can we apply this to our current problem? Specifically, you can calculate the entries of delta quite easily analyzing all of the samples together in one go. What I mean is that you can just do this:

function [theta, costs] = gradientDescent(X, y, theta, alpha, iterations)
    m = length(y);
    costs = zeros(m,1);
    for iter = 1 : iterations
        delta1 = theta(1) - (alpha/m)*(sum((theta(1)*X(:,1) + theta(2)*X(:,2) - y).*X(:,1)));
        delta2 = theta(2) - (alpha/m)*(sum((theta(1)*X(:,1) + theta(2)*X(:,2) - y).*X(:,2)));

        theta = [delta1; delta2];
        costs(iter) = computeCost(X,y,theta);
    end
end

The operations on delta(1) and delta(2) can completely be vectorized in a single statement for both. What you are doing theta^{T}*X^{i} for each sample i from 1, 2, ..., m. You can conveniently place this into a single sum statement.

We can go even further and replace this with purely matrix operations. First off, what you can do is compute theta^{T}*X^{i} for each input sample X^{i} very quickly using matrix multiplication. Suppose if:

Here, X is our data matrix which composes of m rows corresponding to m training samples and n columns corresponding to n features. Similarly, theta is our learned weight vector from gradient descent with n+1 features accounting for the intercept term.

If we compute X*theta, we get:

As you can see here, we have computed the hypothesis for each sample and have placed each into a vector. Each element of this vector is the hypothesis for the ith training sample. Now, recall what the gradient term of each parameter is in gradient descent:

We want to implement this all in one go for all of the parameters in your learned vector, and so putting this into a vector gives us:

Finally:

Therefore, we know that y is already a vector of length m, and so we can very compactly compute gradient descent at each iteration by:

theta = theta - (alpha/m)*X'*(X*theta - y);

.... so your code is now just:

function [theta, costs] = gradientDescent(X, y, theta, alpha, iterations)
    m = length(y);
    costs = zeros(m, 1);
    for iter = 1 : iterations
        theta = theta - (alpha/m)*X'*(X*theta - y);
        costs(iter) = computeCost(X,y,theta);
    end
end

这篇关于机器学习 - 使用批量梯度下降的线性回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆