如何在MATLAB中确定线性回归线的系数? [英] How do I determine the coefficients for a linear regression line in MATLAB?

查看:161
本文介绍了如何在MATLAB中确定线性回归线的系数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要编写一个程序,其中输入是2D点的数据集,而输出是通过最小化最小MSE误差的最佳拟合线的回归系数.

I'm going to write a program where the input is a data set of 2D points and the output is the regression coefficients of the line of best fit by minimizing the minimum MSE error.

我有一些要处理的样本点:

I have some sample points that I would like to process:

  X      Y
1.00    1.00
2.00    2.00
3.00    1.30
4.00    3.75
5.00    2.25

我将如何在MATLAB中执行此操作?

How would I do this in MATLAB?

具体来说,我需要获得以下公式:

Specifically, I need to get the following formula:

y = A + Bx + e

A是截距,B是斜率,而e是每点的残留误差.

A is the intercept and B is the slope while e is the residual error per point.

推荐答案

从您提供的链接以及我对您的问题的理解来看,您想要计算最适合一组数据点的线.您还想从第一条原则开始做这件事.这将需要一些基本的微积分以及一些线性代数来求解2 x 2的方程组.如果您从线性回归理论中回想起,我们希望找到最佳斜率m并截取b,以便对于一组点([x_1,y_1], [x_2,y_2], ..., [x_n,y_n])(即,我们有n个数据点),我们希望最小化这条线与数据点之间的残差平方和.

Judging from the link you provided, and my understanding of your problem, you want to calculate the line of best fit for a set of data points. You also want to do this from first principles. This will require some basic Calculus as well as some linear algebra for solving a 2 x 2 system of equations. If you recall from linear regression theory, we wish to find the best slope m and intercept b such that for a set of points ([x_1,y_1], [x_2,y_2], ..., [x_n,y_n]) (that is, we have n data points), we want to minimize the sum of squared residuals between this line and the data points.

换句话说,我们希望最小化成本函数F(m,b,x,y):

In other words, we wish to minimize the cost function F(m,b,x,y):

mb是此最佳拟合线的斜率和截距,而xy x y的向量坐标构成我们的数据集.

m and b are our slope and intercept for this best fit line, while x and y are a vector of x and y co-ordinates that form our data set.

该函数是凸的,因此我们可以确定一个最佳最小值.可以通过找到每个参数的导数并将它们设置为0来确定最小值.然后,我们求解mb.这背后的直觉是我们同时找到了mb,使得这两个参数共同使成本函数最小化.换句话说:

This function is convex, so there is an optimal minimum that we can determine. The minimum can be determined by finding the derivative with respect to each parameter, and setting these equal to 0. We then solve for m and b. The intuition behind this is that we are simultaneously finding m and b such that the cost function is jointly minimized by these two parameters. In other words:

好,让我们找到第一个数量:

OK, so let's find the first quantity :

当方程的另一边等于0时,我们可以从导数中删除因子2,并且还可以通过在整个乘以-x_i项来进行项的一些分布:

We can drop the factor 2 from the derivative as the other side of the equation is equal to 0, and we can also do some distribution of terms by multiplying the -x_i term throughout:

接下来,让我们处理下一个参数:

Next, let's tackle the next parameter :

我们可以再次降低因子2并在整个表达式中分配-1:

We can again drop the factor of 2 and distribute the -1 throughout the expression:

了解就是n,我们可以将以上内容简化为:

Knowing that is simply n, we can simplify the above to:

现在,我们需要使用上述两个方程式同时求解mb.这将共同最小化成本函数,从而为我们的数据点找到最佳的拟合线.

Now, we need to simultaneously solve for m and b with the above two equations. This will jointly minimize the cost function which finds the best line of fit for our data points.

进行一些重新排列,我们可以将等式一侧的mb隔离,而另一侧的其余隔离:

Doing some re-arranging, we can isolate m and b on one side of the equations and the rest on the other sides:

如您所见,我们可以将其公式化为2 x 2的方程组,以求解mb.具体来说,让我们重新排列上面的两个等式,使其处于矩阵形式:

As you can see, we can formulate this into a 2 x 2 system of equations to solve for m and b. Specifically, let's re-arrange the two equations above so that it's in matrix form:

关于上述情况,我们可以通过求解线性系统Ax = b来分解问题.您要做的就是求解x,即x = A^{-1}*b.给定矩阵,以求出2 x 2系统的逆:

With regards to above, we can decompose the problem by solving a linear system: Ax = b. All you have to do is solve for x, which is x = A^{-1}*b. To find the inverse of a 2 x 2 system, given the matrix:

反函数很简单:

因此,通过将我们的数量代入上式,我们以矩阵形式求解mb,并将其简化为:

Therefore, by substituting our quantities into the above equation, we solve for m and b in matrix form, and it simplifies to this:

进行这种乘法并分别求解mb,得到:

Carrying out this multiplication and solving for m and b individually, this gives:

因此,要找到最佳斜率和截距以最适合您的数据,您需要使用上述等式计算mb.

As such, to find the best slope and intercept to best fit your data, you need to calculate m and b using the above equations.

鉴于您在评论中的链接中指定的数据,我们可以很容易地做到这一点:

Given your data specified in the link in your comments, we can do this quite easily:

%// Define points
X = 1:5;
Y = [1 2 1.3 3.75 2.25];

%// Get total number of points
n = numel(X);

% // Define relevant quantities for finding quantities
sumxi = sum(X);
sumyi = sum(Y);
sumxiyi = sum(X.*Y);
sumxi2 = sum(X.^2);
sumyi2 = sum(Y.^2);

%// Determine slope and intercept
m = (sumxi * sumyi - n*sumxiyi) / (sumxi^2 - n*sumxi2);
b = (sumxiyi * sumxi - sumyi * sumxi2) / (sumxi^2 - n*sumxi2);

%// Display them
disp([m b])

...,我们得到:

0.4250  0.7850

因此,使误差最小的最佳拟合线是:

Therefore, the line of best fit that minimizes the error is:

y = 0.4250*x + 0.7850


但是,如果要使用内置的MATLAB工具,则可以使用 polyfit (信用归Luis Mendo提供). polyfit通过最小化最佳拟合线和数据点之间的平方误差之和,通过线性回归确定最佳拟合线(或n 阶多项式曲线...).您如何调用该函数是这样的:


However, if you want to use built-in MATLAB tools, you can use polyfit (credit goes to Luis Mendo for providing the hint). polyfit determines the line (or nth order polynomial curve rather...) of best fit by linear regression by minimizing the sum of squared errors between the best fit line and your data points. How you call the function is so:

coeff = polyfit(x,y,order);

xy是数据的xy点,而order确定所需的最合适行的顺序.例如,order=1表示该线是线性的,order=2表示该线是二次的,依此类推.本质上,给定您的数据点,polyfit适合阶数为order的多项式.鉴于您的问题,order=1.这样,给定链接中的数据,您只需执行以下操作:

x and y are the x and y points of your data while order determines the order of the line of best fit you want. As an example, order=1 means that the line is linear, order=2 means that the line is quadratic and so on. Essentially, polyfit fits a polynomial of order order given your data points. Given your problem, order=1. As such, given the data in the link, you would simply do:

X = 1:5;
Y = [1 2 1.3 3.75 2.25];
coeff = polyfit(X,Y,1)

coeff =

    0.4250    0.7850

coeff的工作方式是这些是回归线的系数,从递减值的最高顺序开始.因此,上述coeff变量表示回归线的拟合度为:

The way coeff works is that these are the coefficients of the regression line, starting from the highest order in decreasing value. As such, the above coeff variable means that the regression line was fitted as:

y = 0.4250*x + 0.7850

第一个系数是斜率,而第二个系数是截距.您还将看到它与您提供的链接相匹配.

The first coefficient is the slope while the second coefficient is the intercept. You'll also see that this matches up with the link you provided.

如果您想要直观的表示形式,请参见以下数据点图和最适合这些点的回归线:

If you want a visual representation, here's a plot of the data points as well as the regression line that best fits these points:

plot(X, Y, 'r.', X, polyval(coeff, X));

这是情节:

polyval 采用一系列系数(通常由polyfit),然后提供一组x坐标,并计算给定x值的y值.本质上,您正在评估哪些点沿着最佳拟合线.

polyval takes an array of coefficients (usually produced by polyfit), and you provide a set of x co-ordinates and it calculates what the y values are given the values of x. Essentially, you are evaluating what the points are along the best fit line.

如果您想扩展以便找到最适合第n个 多项式的多项式,我将不赘述,但可以归结为构建以下线性系统.给定(x_i, y_i):

If you want to extend so that you're finding the best fit for any nth order polynomial, I won't go into the details, but it boils down to constructing the following linear system. Given the relationship for the ith point between (x_i, y_i):

您将构建以下线性系统:

You would construct the following linear system:

基本上,您将创建一个点y的向量,并构建一个矩阵X,使得每一列表示获取点x的向量并对每列进行幂运算.具体来说,第一列是第零次幂,第一列是第一次幂,第二列是第二次幂,依此类推.您将一直执行到m,这是您想要的阶次多项式. e的向量将是集合中每个点的残留误差.

Basically, you would create a vector of points y, and you would construct a matrix X such that each column denotes taking your vector of points x and applying a power operation to each column. Specifically, the first column is the zero-th power, the first column is the first power, the second column is the second power and so on. You would do this up until m, which is the order polynomial you want. The vector of e would be the residual error for each point in your set.

具体来说,问题的表述可以用矩阵形式表示为:

Specifically, the formulation of the problem can be written in matrix form as:

一旦构建了此矩阵,就可以通过计算

Once you construct this matrix, you would find the parameters by least-squares by calculating the pseudo-inverse. How the pseudo-inverse is derived, you can read it up on the Wikipedia article I linked to, but this is the basis for minimizing a system by least-squares. The pseudo-inverse is the backbone behind least-squares minimization. Specifically:

(X^{T}*X)^{-1}*X^{T}是伪逆. X本身是一个非常流行的矩阵,称为范德蒙矩阵,而MATLAB有一个名为 vander 的命令可帮助您计算该矩阵.需要注意的是,MATLAB中的vander以相反的顺序返回.功效从m-1降低到0.如果您希望将其反转,则需要调用 fliplr 在该输出矩阵上.另外,您将需要在其末尾追加一列,该列是向量,其所有元素都提高到第m 次幂.

(X^{T}*X)^{-1}*X^{T} is the pseudo-inverse. X itself is a very popular matrix, which is known as the Vandermonde matrix and MATLAB has a command called vander to help you compute that matrix. A small note is that vander in MATLAB is returned in reverse order. The powers decrease from m-1 down to 0. If you want to have this reversed, you'd need to call fliplr on that output matrix. Also, you will need to append one more column at the end of it, which is the vector with all of its elements raised to the mth power.

我将不讨论如何以比线性更高的顺序重复示例.我将把它留给您作为学习练习,只是简单地构造向量y,用vander构成矩阵X,然后通过将X的伪逆应用于上面的方法来找到参数.来解决您的参数.

I won't go into how you'd repeat your example for anything higher order than linear. I'm going to leave that to you as a learning exercise, but simply construct the vector y, the matrix X with vander, then find the parameters by applying the pseudo-inverse of X with the above to solve for your parameters.

祝你好运!

这篇关于如何在MATLAB中确定线性回归线的系数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆