Forecast.lm()如何计算置信区间和预测区间? [英] How does predict.lm() compute confidence interval and prediction interval?
问题描述
我进行了回归:
CopierDataRegression <- lm(V1~V2, data=CopierData1)
我的任务是获得一个
-
给定
- 90%置信区间
>时 - 90%预测间隔.
V2=6
和的平均响应- 90% confidence interval for the mean response given
V2=6
and - 90% prediction interval when
V2=6
.
我使用了以下代码:
X6 <- data.frame(V2=6)
predict(CopierDataRegression, X6, se.fit=TRUE, interval="confidence", level=0.90)
predict(CopierDataRegression, X6, se.fit=TRUE, interval="prediction", level=0.90)
我得到了(87.3, 91.9)
和(74.5, 104.8)
这似乎是正确的,因为PI应该更宽.
and I got (87.3, 91.9)
and (74.5, 104.8)
which seems to be correct since the PI should be wider.
两者的输出也包括相同的se.fit = 1.39
. 我不知道此标准错误是什么. PI与CI相比,标准误差是否应该更大?如何在R中找到这两个不同的标准错误?
The output for both also included se.fit = 1.39
which was the same. I don't understand what this standard error is. Shouldn't the standard error be larger for the PI vs. the CI? How do I find these two different standard errors in R?
数据:
CopierData1 <- structure(list(V1 = c(20L, 60L, 46L, 41L, 12L, 137L, 68L, 89L,
4L, 32L, 144L, 156L, 93L, 36L, 72L, 100L, 105L, 131L, 127L, 57L,
66L, 101L, 109L, 74L, 134L, 112L, 18L, 73L, 111L, 96L, 123L,
90L, 20L, 28L, 3L, 57L, 86L, 132L, 112L, 27L, 131L, 34L, 27L,
61L, 77L), V2 = c(2L, 4L, 3L, 2L, 1L, 10L, 5L, 5L, 1L, 2L, 9L,
10L, 6L, 3L, 4L, 8L, 7L, 8L, 10L, 4L, 5L, 7L, 7L, 5L, 9L, 7L,
2L, 5L, 7L, 6L, 8L, 5L, 2L, 2L, 1L, 4L, 5L, 9L, 7L, 1L, 9L, 2L,
2L, 4L, 5L)), .Names = c("V1", "V2"),
class = "data.frame", row.names = c(NA, -45L))
推荐答案
指定interval
和level
自变量时,predict.lm
可以返回置信区间(CI)或预测区间(PI).此答案显示了如何在不设置这些参数的情况下获取CI和PI.有两种方法:
When specifying interval
and level
argument, predict.lm
can return confidence interval (CI) or prediction interval (PI). This answer shows how to obtain CI and PI without setting these arguments. There are two ways:
- 使用来自
predict.lm
的中间结果; - 从头开始做所有事情.
- use middle-stage result from
predict.lm
; - do everything from scratch.
了解如何同时使用两种方法,可以使您对预测过程有透彻的了解.
Knowing how to work with both ways give you a thorough understand of the prediction procedure.
请注意,我们仅介绍predict.lm
的type = "response"
(默认)情况. type = "terms"
的讨论超出了此答案的范围.
Note that we will only cover the type = "response"
(default) case for predict.lm
. Discussion of type = "terms"
is beyond the scope of this answer.
我在这里收集您的代码,以帮助其他读者复制,粘贴和运行.我还更改了变量名称,以使它们具有更清晰的含义.另外,我将newdat
扩展为包括多行,以表明我们的计算是向量化的".
I gather your code here to help other readers to copy, paste and run. I also change variable names so that they have clearer meanings. In addition, I expand the newdat
to include more than one rows, to show that our computations are "vectorized".
dat <- structure(list(V1 = c(20L, 60L, 46L, 41L, 12L, 137L, 68L, 89L,
4L, 32L, 144L, 156L, 93L, 36L, 72L, 100L, 105L, 131L, 127L, 57L,
66L, 101L, 109L, 74L, 134L, 112L, 18L, 73L, 111L, 96L, 123L,
90L, 20L, 28L, 3L, 57L, 86L, 132L, 112L, 27L, 131L, 34L, 27L,
61L, 77L), V2 = c(2L, 4L, 3L, 2L, 1L, 10L, 5L, 5L, 1L, 2L, 9L,
10L, 6L, 3L, 4L, 8L, 7L, 8L, 10L, 4L, 5L, 7L, 7L, 5L, 9L, 7L,
2L, 5L, 7L, 6L, 8L, 5L, 2L, 2L, 1L, 4L, 5L, 9L, 7L, 1L, 9L, 2L,
2L, 4L, 5L)), .Names = c("V1", "V2"),
class = "data.frame", row.names = c(NA, -45L))
lmObject <- lm(V1 ~ V2, data = dat)
newdat <- data.frame(V2 = c(6, 7))
以下是predict.lm
的输出,稍后将与我们的手动计算进行比较.
The following are the output of predict.lm
, to be compared with our manual computations later.
predict(lmObject, newdat, se.fit = TRUE, interval = "confidence", level = 0.90)
#$fit
# fit lwr upr
#1 89.63133 87.28387 91.9788
#2 104.66658 101.95686 107.3763
#
#$se.fit
# 1 2
#1.396411 1.611900
#
#$df
#[1] 43
#
#$residual.scale
#[1] 8.913508
predict(lmObject, newdat, se.fit = TRUE, interval = "prediction", level = 0.90)
#$fit
# fit lwr upr
#1 89.63133 74.46433 104.7983
#2 104.66658 89.43930 119.8939
#
#$se.fit
# 1 2
#1.396411 1.611900
#
#$df
#[1] 43
#
#$residual.scale
#[1] 8.913508
使用来自predict.lm
的中间结果
Use middle-stage result from predict.lm
## use `se.fit = TRUE`
z <- predict(lmObject, newdat, se.fit = TRUE)
#$fit
# 1 2
# 89.63133 104.66658
#
#$se.fit
# 1 2
#1.396411 1.611900
#
#$df
#[1] 43
#
#$residual.scale
#[1] 8.913508
什么是
se.fit
?
z$se.fit
是预测平均值z$fit
的标准误差,用于构建z$fit
的CI.我们还需要具有自由度z$df
的t分布的分位数.
z$se.fit
is the standard error of the predicted mean z$fit
, used to construct CI for z$fit
. We also need quantiles of t-distribution with a degree of freedom z$df
.
alpha <- 0.90 ## 90%
Qt <- c(-1, 1) * qt((1 - alpha) / 2, z$df, lower.tail = FALSE)
#[1] -1.681071 1.681071
## 90% confidence interval
CI <- z$fit + outer(z$se.fit, Qt)
colnames(CI) <- c("lwr", "upr")
CI
# lwr upr
#1 87.28387 91.9788
#2 101.95686 107.3763
我们看到这与predict.lm(, interval = "confidence")
一致.
PI的标准错误是什么?
What is the standard error for PI?
PI比CI宽,因为它说明了残差:
PI is wider than CI, as it accounts for residual variance:
variance_of_PI = variance_of_CI + variance_of_residual
请注意,这是逐点定义的.对于非加权线性回归(如您的示例),残余方差在任何地方都是相等的(称为 homoscedasticity ),并且为z$residual.scale ^ 2
.因此,PI的标准误差为
Note that this is defined point-wise. For a non-weighted linear regression (as in your example), residual variance is equal everywhere (known as homoscedasticity), and it is z$residual.scale ^ 2
. Thus the standard error for PI is
se.PI <- sqrt(z$se.fit ^ 2 + z$residual.scale ^ 2)
# 1 2
#9.022228 9.058082
并且PI的构造为
PI <- z$fit + outer(se.PI, Qt)
colnames(PI) <- c("lwr", "upr")
PI
# lwr upr
#1 74.46433 104.7983
#2 89.43930 119.8939
我们看到这与predict.lm(, interval = "prediction")
一致.
备注
如果您具有权重线性回归,则情况会更加复杂,其中残差在各处均不相等,因此应加权z$residual.scale ^ 2
.为拟合值构造PI更为容易(也就是说,在predict.lm
中使用type = "prediction"
时不设置newdata
),因为权重是已知的(使用时必须通过weight
参数提供它) lm
).对于样本外预测(即,将newdata
传递给predict.lm
),predict.lm
希望您告诉它应该如何加权残差方差.您需要在predict.lm
中使用参数pred.var
或weights
,否则您会收到predict.lm
的警告,抱怨缺少用于构造PI的信息. ?predict.lm
中引用了以下内容:
Things are more complicated if you have a weight linear regression, where the residual variance is not equal everywhere so that z$residual.scale ^ 2
should be weighted. It is easier to construct PI for fitted values (that is, you don't set newdata
when using type = "prediction"
in predict.lm
), because the weights are known (you must have provided it via weight
argument when using lm
). For out-of-sample prediction (that is, you pass a newdata
to predict.lm
), predict.lm
expects you to tell it how residual variance should be weighted. You need either use argument pred.var
or weights
in predict.lm
, otherwise you get a warning from predict.lm
complaining insufficient information for constructing PI. The following are quoted from ?predict.lm
:
The prediction intervals are for a single observation at each case
in ‘newdata’ (or by default, the data used for the fit) with error
variance(s) ‘pred.var’. This can be a multiple of ‘res.var’, the
estimated value of sigma^2: the default is to assume that future
observations have the same error variance as those used for
fitting. If ‘weights’ is supplied, the inverse of this is used as
a scale factor. For a weighted fit, if the prediction is for the
original data frame, ‘weights’ defaults to the weights used for
the model fit, with a warning since it might not be the intended
result. If the fit was weighted and ‘newdata’ is given, the
default is to assume constant prediction variance, with a warning.
请注意,CI的构造不受回归类型的影响.
Note that construction of CI is not affected by the type of regression.
基本上,我们想知道如何在z
中获得fit
,se.fit
,df
和residual.scale
.
Basically we want to know how to obtain fit
, se.fit
, df
and residual.scale
in z
.
预测平均值可以通过矩阵矢量乘积Xp %*% b
进行计算,其中Xp
是线性预测矩阵,b
是回归系数矢量.
The predicted mean can be computed by a matrix-vector multiplication Xp %*% b
, where Xp
is the linear predictor matrix and b
is regression coefficient vector.
Xp <- model.matrix(delete.response(terms(lmObject)), newdat)
b <- coef(lmObject)
yh <- c(Xp %*% b) ## c() reshape the single-column matrix to a vector
#[1] 89.63133 104.66658
我们看到这与z$fit
一致. yh
的方差-协方差是Xp %*% V %*% t(Xp)
,其中V
是b
的方差-协方差矩阵,可以通过以下公式计算
And we see that this agrees with z$fit
. The variance-covariance for yh
is Xp %*% V %*% t(Xp)
, where V
is the variance-covariance matrix of b
which can be computed by
V <- vcov(lmObject) ## use `vcov` function in R
# (Intercept) V2
# (Intercept) 7.862086 -1.1927966
# V2 -1.192797 0.2333733
yh
的完整方差-协方差矩阵无需计算逐点CI或PI.我们只需要它的主要对角线.因此,除了执行diag(Xp %*% V %*% t(Xp))
之外,我们还可以通过以下方式更有效地进行
The full variance-covariance matrix of yh
is not needed to compute point-wise CI or PI. We only need its main diagonal. So instead of doing diag(Xp %*% V %*% t(Xp))
, we can do it more efficiently via
var.fit <- rowSums((Xp %*% V) * Xp) ## point-wise variance for predicted mean
# 1 2
#1.949963 2.598222
sqrt(var.fit) ## this agrees with `z$se.fit`
# 1 2
#1.396411 1.611900
剩余的自由度在拟合模型中很容易获得:
The residual degree of freedom is readily available in the fitted model:
dof <- df.residual(lmObject)
#[1] 43
最后,要计算残差,请使用Pearson估计器:
Finally, to compute residual variance, use Pearson estimator:
sig2 <- c(crossprod(lmObject$residuals)) / dof
# [1] 79.45063
sqrt(sig2) ## this agrees with `z$residual.scale`
#[1] 8.913508
备注
请注意,在进行加权回归的情况下,sig2
的计算应为
Note that in case of weighted regression, sig2
should be computed as
sig2 <- c(crossprod(sqrt(lmObject$weights) * lmObject$residuals)) / dof
附录:模仿predict.lm
的自写函数
在本问答中,从头开始做所有事情"中的代码已被清晰地组织成函数lm_predict
.答:带有lm
的线性模型:如何获取预测值总和的预测方差.
Appendix: a self-written function that mimics predict.lm
The code in "Do everything from scratch" has been cleanly organized into a function lm_predict
in this Q & A: linear model with lm
: how to get prediction variance of sum of predicted values.
这篇关于Forecast.lm()如何计算置信区间和预测区间?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!