求解线性回归的梯度下降法和正态方程法给出不同的解 [英] Gradient descent and normal equation method for solving linear regression gives different solutions

查看:93
本文介绍了求解线性回归的梯度下降法和正态方程法给出不同的解的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究机器学习问题,并希望使用线性回归作为学习算法.我已经实现了2种不同的方法来查找线性回归模型的参数theta:梯度(最陡)下降和法线方程.在相同的数据上,它们都应给出近似相等的theta向量.但是他们没有.

I'm working on machine learning problem and want to use linear regression as learning algorithm. I have implemented 2 different methods to find parameters theta of linear regression model: Gradient (steepest) descent and Normal equation. On the same data they should both give approximately equal theta vector. However they do not.

除第一个元素外,所有theta向量在所有元素上都非常相似.那是用于将所有添加到数据中的全部1的向量相乘的一个.

Both theta vectors are very similar on all elements but the first one. That is the one used to multiply vector of all 1 added to the data.

这是theta的样子(第一列是Gradient descent的输出,Normal方程的第二个输出):

Here is how the thetas look like (fist column is output of Gradient descent, second output of Normal equation):

Grad desc Norm eq
-237.7752 -4.6736
-5.8471   -5.8467
9.9174    9.9178
2.1135    2.1134
-1.5001   -1.5003
-37.8558  -37.8505
-1.1024   -1.1116
-19.2969  -19.2956
66.6423   66.6447
297.3666  296.7604
-741.9281 -744.1541
296.4649  296.3494
146.0304  144.4158
-2.9978   -2.9976
-0.8190   -0.8189

与正常方程式返回的theta(1, 1)相比,梯度下降返回的theta(1, 1)会导致什么差异?我的代码中有错误吗?

What can cause the difference in theta(1, 1) returned by gradient descent compared to theta(1, 1) returned by normal equation? Do I have bug in my code?

这是我在Matlab中实现正规方程的方法:

Here is my implementation of normal equation in Matlab:

function theta = normalEque(X, y)
    [m, n] = size(X);
    X = [ones(m, 1), X];
    theta = pinv(X'*X)*X'*y;
end

这是梯度下降的代码:

function theta = gradientDesc(X, y)
    options = optimset('GradObj', 'on', 'MaxIter',  9999);
    [theta, ~, ~] = fminunc(@(t)(cost(t, X, y)),...
                    zeros(size(X, 2), 1), options);
end

function [J, grad] = cost(theta, X, y)
    m = size(X, 1);
    X = [ones(m, 1), X];
    J = sum((X * theta - y) .^ 2) ./ (2*m);
    for i = 1:size(theta, 1)
        grad(i, 1) = sum((X * theta - y) .* X(:, i)) ./ m;
    end
end

我将完全相同的数据Xy传递给两个函数(我未对X进行标准化).

I pass exactly the same data X and y to both functions (I do not normalize X).

基于答案和评论,我检查了一些代码并进行了一些测试.

Based on answers and comments I checked few my code and run some tests.

首先,我想检查问题是否是由 @ user1489497的答案所建议的X蜂在奇异点附近引起的.因此,我用inv替换了pinv-在运行它时,我确实得到了警告Matrix is close to singular or badly scaled..为了确保这不是问题,我获得了更大的数据集,并使用此新数据集运行测试.这次inv(X)没有显示警告,并且使用pinvinv给出了相同的结果.因此,我希望 X不再接近于单数.

First I want to check if the problem can be cause by X beeing near singular as suggested by @user1489497's answer. So I replaced pinv by inv - and when run it I really got warning Matrix is close to singular or badly scaled.. To be sure that that is not the problem I obtained much larger dataset and run tests with this new dataset. This time inv(X) did not display the warning and using pinv and inv gave same results. So I hope that X is not close to singular any more.

然后我根据建议更改了normalEque代码,方法是 woodchips ,现在看起来:

Then I changed normalEque code as suggested by woodchips so now it looks like:

function theta = normalEque(X, y)
    X = [ones(size(X, 1), 1), X];
    theta = pinv(X)*y;
end

但是问题仍然存在.对不接近单数的新数据的新normalEque函数为gradientDesc提供了不同的theta.

However the problem is still there. New normalEque function on new data that are not close to singular gives different theta as gradientDesc.

要找出哪种算法是错误的,我对相同数据运行了数据挖掘软件Weka的线性回归算法. Weka计算的theta与normalEque的输出非常相似,但与gradientDesc的输出不同.所以我想normalEque是正确的,并且 gradientDesc 中存在错误.

To find out which algorithm is buggy I have run linear regression algorithm of data mining software Weka on the same data. Weka computed theta very similar to output of normalEque but different to the output of gradientDesc. So I guess that normalEque is correct and there is a bug in gradientDesc.

这是Weka,normalEqueGradientDesc计算出的theta的比较:

Here is comparison of thetas computed by Weka, normalEque and GradientDesc:

Weka(correct) normalEque    gradientDesc
779.8229      779.8163      302.7994
  1.6571        1.6571        1.7064
  1.8430        1.8431        2.3809
 -1.5945       -1.5945       -1.5964
  3.8190        3.8195        5.7486
 -4.8265       -4.8284      -11.1071
 -6.9000       -6.9006      -11.8924
-15.6956      -15.6958      -13.5411
 43.5561       43.5571       31.5036
-44.5380      -44.5386      -26.5137
  0.9935        0.9926        1.2153
 -3.1556       -3.1576       -1.8517
 -0.1927       -0.1919       -0.6583
  2.9207        2.9227        1.5632
  1.1713        1.1710        1.1622
  0.1091        0.1093        0.0084
  1.5768        1.5762        1.6318
 -1.3968       -1.3958       -2.1131
  0.6966        0.6963        0.5630
  0.1990        0.1990       -0.2521
  0.4624        0.4624        0.2921
-12.6013      -12.6014      -12.2014
 -0.1328       -0.1328       -0.1359

我还按照 Justin Peel的答案的建议计算了错误. normalEque的输出给出较小的平方误差,但差异很小. 当我使用函数cost(与gradientDesc使用的函数相同)计算theta的成本梯度时,我得到的梯度接近零.对gradientDesc的输出进行的相同操作不会使梯度接近零.这是我的意思:

I also computed errors as suggested by Justin Peel's answer. Output of normalEque gives slightly lesser squared error but the difference is marginal. What is more when I compute gradient of cost of theta using function cost (the same as the one used by gradientDesc) I got gradient near zero. Same done on output of gradientDesc does not give gradient near zero. Here is what I mean:

>> [J_gd, grad_gd] = cost(theta_gd, X, y, size(X, 1));
>> [J_ne, grad_ne] = cost(theta_ne, X, y, size(X, 1));
>> disp([J_gd, J_ne])
  120.9932  119.1469
>> disp([grad_gd, grad_ne])
  -0.005172856743846  -0.000000000908598
  -0.026126463200876  -0.000000135414602
  -0.008365136595272  -0.000000140327001
  -0.094516503056041  -0.000000169627717
  -0.028805977931093  -0.000000045136985
  -0.004761477661464  -0.000000005065103
  -0.007389474786628  -0.000000005010731
   0.065544198835505  -0.000000046847073
   0.044205371015018  -0.000000046169012
   0.089237705611538  -0.000000046081288
  -0.042549228192766  -0.000000051458654
   0.016339232547159  -0.000000037654965
  -0.043200042729041  -0.000000051748545
   0.013669010209370  -0.000000037399261
  -0.036586854750176  -0.000000027931617
  -0.004761447097231  -0.000000027168798
   0.017311225027280  -0.000000039099380
   0.005650124339593  -0.000000037005759
   0.016225097484138  -0.000000039060168
  -0.009176443862037  -0.000000012831350
   0.055653840638386  -0.000000020855391
  -0.002834810081935  -0.000000006540702
   0.002794661393905  -0.000000032878097

这表明梯度下降根本没有收敛到全局最小值...但是当我运行它数千次迭代时,情况并非如此. 那么错误在哪里?

This would suggest that gradient descent simply did not converge to global minimum... But that is hardly the case as I run it for thousands of iterations. So where is the bug?

推荐答案

我终于有时间回到这个问题上.没有"bug".

I finally had time to get back to this. There is no "bug".

如果矩阵是奇异的,则有无限多个解.您可以从该组中选择任何解决方案,并获得同样好的答案. pinv(X)* y解是一个很好的解决方案,因为它是最小范数解.

If the matrix is singular, then there are infinitely many solutions. You can choose any solution from that set, and get equally as good an answer. The pinv(X)*y solution is a good one that many like because it is the minimum norm solution.

从来没有充分的理由使用inv(X)* y.更糟糕的是,对正规方程使用逆函数,因此inv(X'* X)* X'* y只是数字废话.我不在乎是谁告诉您使用它,他们正在引导您到达错误的位置. (是的,对于条件良好的问题,它可以接受,但是大多数时候您都不知道什么时候该给您带来麻烦.那为什么要使用它呢?)

There is NEVER a good reason to use inv(X)*y. Even worse, is to use inverse on the normal equations, thus inv(X'*X)*X'*y is simply numerical crap. I don't care who told you to use it, they are guiding you to the wrong place. (Yes, it will work acceptably for problems that are well-conditioned, but most of the time you don't know when it is about to give you crap. So why use it?)

一般的方程式通常是一件不好的事情,即使您正在解决正则化的问题也是如此.有一些方法可以避免对系统的条件数求平方,尽管除非被问到这个答案已经足够长,否则我不会解释它们.

The normal equations are in general a bad thing to do, EVEN if you are solving a regularized problem. There are ways to do that that avoid squaring the condition number of the system, although I won't explain them unless asked as this answer has gotten long enough.

X \ y也会产生合理的结果.

X\y will also yield a result that is reasonable.

绝对没有充分的理由对这个问题抛出无限制的优化器,因为这将导致不稳定的结果,完全取决于您的起始值.

There is ABSOLUTELY no good reason to throw an unconstrained optimizer at the problem, as this will yield results that are unstable, completely dependent on your starting values.

作为一个例子,我将从一个单一的问题开始.

As an example, I'll start with a singular problem.

X = repmat([1 2],5,1);
y = rand(5,1);

>> X\y
Warning: Rank deficient, rank = 1, tol =  2.220446e-15. 
ans =
                         0
         0.258777984694222

>> pinv(X)*y
ans =
         0.103511193877689
         0.207022387755377

pinv和反斜杠返回略有不同的解决方案.事实证明,有一个基本的解决方案,我们可以为X的行空间添加任意数量的空空间向量.

pinv and backslash return slightly different solutions. As it turns out, there is a basic solution, to which we can add ANY amount of the nullspace vector for the row space of X.

null(X)
ans =
         0.894427190999916
        -0.447213595499958

pinv生成最小范数解.在可能产生的所有解决方案中,该解决方案的最小2范数.

pinv generates the minimum norm solution. Of all of the solutions that might have resulted, this one has minimum 2-norm.

相反,反斜杠生成的解决方案会将一个或多个变量设置为零.

In contrast, backslash generates a solution that will have one or more variables set to zero.

但是,如果使用不受约束的优化器,它将生成完全取决于您的初始值的解决方案.同样,可以将ANLY数量的空向量添加到您的解中,您仍然拥有一个完全有效的解,并且误差平方和的值相同.

But if you use an unconstrained optimizer, it will generate a solution that is completely dependent on your starting values. Again, ANLY amount of that null vector can be added to your solution, and you still have an entirely valid solution, with the same value of the sum of squares of errors.

请注意,即使没有返回奇异警告,这也并不意味着您的矩阵并不接近奇异.您对该问题的更改不大,因此仍处于关闭状态,仅不足以触发警告.

Note that even though no singularity waring is returned, this need not mean your matrix is not close to singular. You have changed little about the problem, so it is STILL close, just not enough to trigger the warning.

这篇关于求解线性回归的梯度下降法和正态方程法给出不同的解的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆