使用lm()进行线性回归-对结果感到惊讶 [英] linear regression using lm() - surprised by the result

查看:282
本文介绍了使用lm()进行线性回归-对结果感到惊讶的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用lm函数对我拥有的数据进行了线性回归.一切正常(没有错误消息),但是结果让我感到有些惊讶:我给R留下了一组点的印象,即截距和坡度不是最合适的.例如,我指的是坐标x = 15-25,y = 0-20的一组点.

我的问题:

  • 是否有一个函数可以与预期"系数和"lm计算"系数进行拟合比较?
  • 在编码时我犯了一个愚蠢的错误,导致lm做 那?

以下是一些答案:有关x和y的其他信息

x和y都是疾病症状的视觉估计.两者都有相同的不确定性.

数据和代码在这里:

x1=c(24.0,23.9,23.6,21.6,21.0,20.8,22.4,22.6,
     21.6,21.2,19.0,19.4,21.1,21.5,21.5,20.1,20.1,
     20.1,17.2,18.6,21.5,18.2,23.2,20.4,19.2,22.4,
     18.8,17.9,19.1,17.9,19.6,18.1,17.6,17.4,17.5,
     17.5,25.2,24.4,25.6,24.3,24.6,24.3,29.4,29.4,
     29.1,28.5,27.2,27.9,31.5,31.5,31.5,27.8,31.2,
     27.4,28.8,27.9,27.6,26.9,28.0,28.0,33.0,32.0,
     34.2,34.0,32.6,30.8)

y1=c(100.0,95.5,93.5,100.0,98.5,99.5,34.8,
     45.8,47.5,17.4,42.6,63.0,6.9,12.1,30.5,
     10.5,14.3,41.1, 2.2,20.0,9.8,3.5,0.5,3.5,5.7,
     3.1,19.2,6.4, 1.2, 4.5, 5.7, 3.1,19.2, 6.4,
     1.2,4.5,81.5,70.5,91.5,75.0,59.5,73.3,66.5,
     47.0,60.5,47.5,33.0,62.5,87.0,86.0,77.0,
     86.0,83.0,78.5,83.0,83.5,73.0,69.5,82.5,78.5,
     84.0,93.5,83.5,96.5,96.0,97.5)   



## x11()
plot(x1,y1,xlim=c(0,35),ylim=c(0,100))

# linear regression
reg_lin=lm(y1 ~ x1)
abline(reg_lin,lty="solid", col="royalblue")
text(12.5,25,labels="R result",col="royalblue", cex=0.85)
text(12.5,20,labels=bquote(y== .(5.26)*x - .(76)),col="royalblue", cex=0.85)

# result I would have imagined
abline(a=-150,b=8,lty="dashed", col="red")
text(27.5,25,labels="What I think is better",col="red", cex=0.85)
text(27.5,20,labels=bquote(y== .(8)*x - .(150)),col="red", cex=0.85)

解决方案

尝试一下:

reg_lin_int <- reg_lin$coefficients[1]
reg_lin_slp <- reg_lin$coefficients[2]

sum((y1 - (reg_lin_int + reg_lin_slp*x1)) ^ 2)
# [1] 39486.33
sum((y1 - (-150 + 8 * x1)) ^ 2)
# [1] 55583.18

残差平方和在lm拟合线下较低.这是可以预料的,因为保证reg_lin_intreg_lin_slp产生最小的总平方误差.

直觉上,我们知道平方损失函数下的估计量对异常值敏感.之所以错过"底部的组,是因为它越来越靠近距离更远的左上方的组,并且平方的距离赋予了这些点更多的权重.

实际上,如果我们使用最小绝对偏差回归(即,指定绝对损失函数而不是正方形),结果更接近您的猜测:

library(quantreg)
lad_reg <- rq(y1 ~ x1)

(提示:使用lwd使图形更容易理解)

如上所述,总最小二乘法与您的想法更加接近.通过@nongkrong和@MikeWilliamson.这是样本中TLS的结果:

v <- prcomp(cbind(x1, y1))$rotation
bbeta <- v[-ncol(v), ncol(v)] / v[1, 1]
inter <- mean(y1) - bbeta * mean(x1)

I used a linear regression on data I have, using the lm function. Everything works (no error message), but I'm somehow surprised by the result: I am under the impression R "misses" a group of points, i.e. the intercept and slope are not the best fit. For instance, I am referring to the group of points at coordinates x=15-25,y=0-20.

My questions:

  • is there a function to compare fit with "expected" coefficients and "lm-calculated" coefficients?
  • have I made a silly mistake when coding, leading the lm to do that?

Following some answers: additionnal information on x and y

x and y are both visual estimates of disease symptoms. There is the same uncertainty on both of them.

The data and code are here:

x1=c(24.0,23.9,23.6,21.6,21.0,20.8,22.4,22.6,
     21.6,21.2,19.0,19.4,21.1,21.5,21.5,20.1,20.1,
     20.1,17.2,18.6,21.5,18.2,23.2,20.4,19.2,22.4,
     18.8,17.9,19.1,17.9,19.6,18.1,17.6,17.4,17.5,
     17.5,25.2,24.4,25.6,24.3,24.6,24.3,29.4,29.4,
     29.1,28.5,27.2,27.9,31.5,31.5,31.5,27.8,31.2,
     27.4,28.8,27.9,27.6,26.9,28.0,28.0,33.0,32.0,
     34.2,34.0,32.6,30.8)

y1=c(100.0,95.5,93.5,100.0,98.5,99.5,34.8,
     45.8,47.5,17.4,42.6,63.0,6.9,12.1,30.5,
     10.5,14.3,41.1, 2.2,20.0,9.8,3.5,0.5,3.5,5.7,
     3.1,19.2,6.4, 1.2, 4.5, 5.7, 3.1,19.2, 6.4,
     1.2,4.5,81.5,70.5,91.5,75.0,59.5,73.3,66.5,
     47.0,60.5,47.5,33.0,62.5,87.0,86.0,77.0,
     86.0,83.0,78.5,83.0,83.5,73.0,69.5,82.5,78.5,
     84.0,93.5,83.5,96.5,96.0,97.5)   



## x11()
plot(x1,y1,xlim=c(0,35),ylim=c(0,100))

# linear regression
reg_lin=lm(y1 ~ x1)
abline(reg_lin,lty="solid", col="royalblue")
text(12.5,25,labels="R result",col="royalblue", cex=0.85)
text(12.5,20,labels=bquote(y== .(5.26)*x - .(76)),col="royalblue", cex=0.85)

# result I would have imagined
abline(a=-150,b=8,lty="dashed", col="red")
text(27.5,25,labels="What I think is better",col="red", cex=0.85)
text(27.5,20,labels=bquote(y== .(8)*x - .(150)),col="red", cex=0.85)

解决方案

Try this:

reg_lin_int <- reg_lin$coefficients[1]
reg_lin_slp <- reg_lin$coefficients[2]

sum((y1 - (reg_lin_int + reg_lin_slp*x1)) ^ 2)
# [1] 39486.33
sum((y1 - (-150 + 8 * x1)) ^ 2)
# [1] 55583.18

The sum of squared residuals is lower under the lm fit line. This is to be expected, as reg_lin_int and reg_lin_slp are guaranteed to produce the minimal total squared error.

Intuitively, we know estimators under squared loss functions are sensitive to outliers. It's "missing" the group at the bottom because it gets closer to the group at the top left that's much further away--and squared distance gives these points more weight.

In fact, if we use Least Absolute Deviations regression (i.e., specify an absolute loss function instead of a square), the result is much closer to your guess:

library(quantreg)
lad_reg <- rq(y1 ~ x1)

(Pro tip: use lwd to make your graphs much more readable)

What gets even closer to what you had in mind is Total Least Squares, as mentioned by @nongkrong and @MikeWilliamson. Here is the result of TLS on your sample:

v <- prcomp(cbind(x1, y1))$rotation
bbeta <- v[-ncol(v), ncol(v)] / v[1, 1]
inter <- mean(y1) - bbeta * mean(x1)

这篇关于使用lm()进行线性回归-对结果感到惊讶的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆