如何确定 MATLAB 中线性回归线的系数? [英] How do I determine the coefficients for a linear regression line in MATLAB?

查看:48
本文介绍了如何确定 MATLAB 中线性回归线的系数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将编写一个程序,其中输入是二维点的数据集,输出是通过最小化最小 MSE 误差得到的最佳拟合线的回归系数.

我有一些要处理的样本点:

 X Y1.00 1.002.00 2.003.00 1.304.00 3.755.00 2.25

我将如何在 MATLAB 中执行此操作?

具体来说,我需要得到以下公式:

y = A + Bx + e

A 是截距,B 是斜率,e 是每个点的残差.

解决方案

从您提供的链接以及我对您问题的理解来看,您想要计算一组数据点的最佳拟合线.你也想从首要原则来做到这一点.这将需要一些基本的微积分以及一些线性代数来求解 2 x 2 方程组.如果您回忆一下线性回归理论,我们希望找到最佳斜率 m 并截取 b 使得对于一组点 ([x_1,y_1],[x_2,y_2], ..., [x_n,y_n])(即我们有 n 个数据点),我们想最小化这条线之间的残差平方和和数据点.

换句话说,我们希望最小化成本函数F(m,b,x,y):

mb 是这条最佳拟合线的斜率和截距,而 xyxy 坐标的向量构成我们的数据集.

这个函数是凸函数,所以我们可以确定一个最优最小值.最小值可以通过找到每个参数的导数来确定,并将它们设置为 0.然后我们求解 mb.这背后的直觉是我们同时找到 mb 使得成本函数通过这两个参数联合最小化.换句话说:

好的,让我们找到第一个数量 :

我们可以从导数中去掉因子 2,因为等式的另一边等于 0,我们也可以通过乘以 -x_i 项来做一些项的分布:>

接下来,让我们处理下一个参数 :

我们可以再次删除因子 2 并在整个表达式中分配 -1:

知道 就是简单的n,我们可以把上面的简化成:

现在,我们需要用上述两个方程同时求解mb.这将共同最小化成本函数,从而为我们的数据点找到最佳拟合线.

进行一些重新排列,我们可以在方程的一侧隔离mb,而在另一侧隔离其余部分:

如您所见,我们可以将其公式化为 2 x 2 方程组来求解 mb.具体来说,让我们重新排列上面的两个方程,使其成为矩阵形式:

<小时>

关于上述,我们可以通过求解一个线性系统来分解问题:Ax = b.你所要做的就是解出x,即x = A^{-1}*b.要找到 2 x 2 系统的逆矩阵,给定矩阵:

反面很简单:

因此,通过将我们的数量代入上述方程,我们以矩阵形式求解mb,并简化为:

分别对 mb 进行乘法和求解,得到:

因此,要找到最适合您的数据的最佳斜率和截距,您需要使用上述等式计算 mb.

鉴于您在评论中的链接中指定的数据,我们可以很容易地做到这一点:

%//定义点X = 1:5;Y = [1 2 1.3 3.75 2.25];%//获取总点数n = 数字(X);%//定义相关数量以查找数量sumxi = sum(X);sumyi = sum(Y);sumxiyi = sum(X.*Y);sumxi2 = sum(X.^2);sumyi2 = sum(Y.^2);%//确定斜率和截距m = (sumxi * sumyi - n*sumxiyi)/(sumxi^2 - n*sumxi2);b = (sumxiyi * sumxi - sumyi * sumxi2)/(sumxi^2 - n*sumxi2);%//显示它们显示([m b])

...我们得到:

0.4250 0.7850

因此,最小化误差的最佳拟合线是:

y = 0.4250*x + 0.7850

<小时>

但是,如果您想使用内置的 MATLAB 工具,您可以使用 需要一个数组系数(通常由 polyfit 生成),并且您提供一组 x 坐标,它计算给定值的 yx 的.本质上,您正在评估沿着最佳拟合线的点.

<小时>

编辑 - 扩展到更高阶

如果你想扩展以便找到最适合任何 nth 阶多项式,我不会详细介绍,但归结为构建以下线性系统.给定 (x_i, y_i) 之间第 i 点的关系:

您将构建以下线性系统:

基本上,您将创建一个点向量 y,并且您将构建一个矩阵 X,这样每一列表示取您的点向量 x 并对每一列应用幂运算.具体来说,第一列是零次幂,第一列是第一次幂,第二列是第二次幂,依此类推.你会一直这样做直到 m,这是你想要的阶多项式.e 的向量将是集合中每个点的残差.

具体来说,问题的表述可以写成矩阵形式:

一旦你构造了这个矩阵,你就会通过计算 并且 MATLAB 有一个名为 vander 的命令帮助您计算该矩阵.一个小注意事项是 MATLAB 中的 vander 以相反的顺序返回.幂从 m-1 下降到 0.如果你想把它颠倒过来,你需要调用 fliplr 在该输出矩阵上.此外,您还需要在其末尾追加一列,即所有元素都提升到 mth 次方的向量.

我不会讨论您如何为比线性更高的阶数重复您的示例.我将把它留给你作为一个学习练习,但只需用 vander 构造向量 y、矩阵 X,然后找到通过应用 X 的伪逆和上述来求解参数.

<小时>

祝你好运!

I'm going to write a program where the input is a data set of 2D points and the output is the regression coefficients of the line of best fit by minimizing the minimum MSE error.

I have some sample points that I would like to process:

  X      Y
1.00    1.00
2.00    2.00
3.00    1.30
4.00    3.75
5.00    2.25

How would I do this in MATLAB?

Specifically, I need to get the following formula:

y = A + Bx + e

A is the intercept and B is the slope while e is the residual error per point.

解决方案

Judging from the link you provided, and my understanding of your problem, you want to calculate the line of best fit for a set of data points. You also want to do this from first principles. This will require some basic Calculus as well as some linear algebra for solving a 2 x 2 system of equations. If you recall from linear regression theory, we wish to find the best slope m and intercept b such that for a set of points ([x_1,y_1], [x_2,y_2], ..., [x_n,y_n]) (that is, we have n data points), we want to minimize the sum of squared residuals between this line and the data points.

In other words, we wish to minimize the cost function F(m,b,x,y):

m and b are our slope and intercept for this best fit line, while x and y are a vector of x and y co-ordinates that form our data set.

This function is convex, so there is an optimal minimum that we can determine. The minimum can be determined by finding the derivative with respect to each parameter, and setting these equal to 0. We then solve for m and b. The intuition behind this is that we are simultaneously finding m and b such that the cost function is jointly minimized by these two parameters. In other words:

OK, so let's find the first quantity :

We can drop the factor 2 from the derivative as the other side of the equation is equal to 0, and we can also do some distribution of terms by multiplying the -x_i term throughout:

Next, let's tackle the next parameter :

We can again drop the factor of 2 and distribute the -1 throughout the expression:

Knowing that is simply n, we can simplify the above to:

Now, we need to simultaneously solve for m and b with the above two equations. This will jointly minimize the cost function which finds the best line of fit for our data points.

Doing some re-arranging, we can isolate m and b on one side of the equations and the rest on the other sides:

As you can see, we can formulate this into a 2 x 2 system of equations to solve for m and b. Specifically, let's re-arrange the two equations above so that it's in matrix form:


With regards to above, we can decompose the problem by solving a linear system: Ax = b. All you have to do is solve for x, which is x = A^{-1}*b. To find the inverse of a 2 x 2 system, given the matrix:

The inverse is simply:

Therefore, by substituting our quantities into the above equation, we solve for m and b in matrix form, and it simplifies to this:

Carrying out this multiplication and solving for m and b individually, this gives:

As such, to find the best slope and intercept to best fit your data, you need to calculate m and b using the above equations.

Given your data specified in the link in your comments, we can do this quite easily:

%// Define points
X = 1:5;
Y = [1 2 1.3 3.75 2.25];

%// Get total number of points
n = numel(X);

% // Define relevant quantities for finding quantities
sumxi = sum(X);
sumyi = sum(Y);
sumxiyi = sum(X.*Y);
sumxi2 = sum(X.^2);
sumyi2 = sum(Y.^2);

%// Determine slope and intercept
m = (sumxi * sumyi - n*sumxiyi) / (sumxi^2 - n*sumxi2);
b = (sumxiyi * sumxi - sumyi * sumxi2) / (sumxi^2 - n*sumxi2);

%// Display them
disp([m b])

... and we get:

0.4250  0.7850

Therefore, the line of best fit that minimizes the error is:

y = 0.4250*x + 0.7850


However, if you want to use built-in MATLAB tools, you can use polyfit (credit goes to Luis Mendo for providing the hint). polyfit determines the line (or nth order polynomial curve rather...) of best fit by linear regression by minimizing the sum of squared errors between the best fit line and your data points. How you call the function is so:

coeff = polyfit(x,y,order);

x and y are the x and y points of your data while order determines the order of the line of best fit you want. As an example, order=1 means that the line is linear, order=2 means that the line is quadratic and so on. Essentially, polyfit fits a polynomial of order order given your data points. Given your problem, order=1. As such, given the data in the link, you would simply do:

X = 1:5;
Y = [1 2 1.3 3.75 2.25];
coeff = polyfit(X,Y,1)

coeff =

    0.4250    0.7850

The way coeff works is that these are the coefficients of the regression line, starting from the highest order in decreasing value. As such, the above coeff variable means that the regression line was fitted as:

y = 0.4250*x + 0.7850

The first coefficient is the slope while the second coefficient is the intercept. You'll also see that this matches up with the link you provided.

If you want a visual representation, here's a plot of the data points as well as the regression line that best fits these points:

plot(X, Y, 'r.', X, polyval(coeff, X));

Here's the plot:

polyval takes an array of coefficients (usually produced by polyfit), and you provide a set of x co-ordinates and it calculates what the y values are given the values of x. Essentially, you are evaluating what the points are along the best fit line.


Edit - Extending to higher orders

If you want to extend so that you're finding the best fit for any nth order polynomial, I won't go into the details, but it boils down to constructing the following linear system. Given the relationship for the ith point between (x_i, y_i):

You would construct the following linear system:

Basically, you would create a vector of points y, and you would construct a matrix X such that each column denotes taking your vector of points x and applying a power operation to each column. Specifically, the first column is the zero-th power, the first column is the first power, the second column is the second power and so on. You would do this up until m, which is the order polynomial you want. The vector of e would be the residual error for each point in your set.

Specifically, the formulation of the problem can be written in matrix form as:

Once you construct this matrix, you would find the parameters by least-squares by calculating the pseudo-inverse. How the pseudo-inverse is derived, you can read it up on the Wikipedia article I linked to, but this is the basis for minimizing a system by least-squares. The pseudo-inverse is the backbone behind least-squares minimization. Specifically:

(X^{T}*X)^{-1}*X^{T} is the pseudo-inverse. X itself is a very popular matrix, which is known as the Vandermonde matrix and MATLAB has a command called vander to help you compute that matrix. A small note is that vander in MATLAB is returned in reverse order. The powers decrease from m-1 down to 0. If you want to have this reversed, you'd need to call fliplr on that output matrix. Also, you will need to append one more column at the end of it, which is the vector with all of its elements raised to the mth power.

I won't go into how you'd repeat your example for anything higher order than linear. I'm going to leave that to you as a learning exercise, but simply construct the vector y, the matrix X with vander, then find the parameters by applying the pseudo-inverse of X with the above to solve for your parameters.


Good luck!

这篇关于如何确定 MATLAB 中线性回归线的系数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆