Forecast.lm()如何计算置信区间和预测区间? [英] How does predict.lm() compute confidence interval and prediction interval?

查看:266
本文介绍了Forecast.lm()如何计算置信区间和预测区间?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我进行了回归:

CopierDataRegression <- lm(V1~V2, data=CopierData1)

我的任务是获得一个

    给定V2=6和的平均响应
  • 90%置信区间 >时
  • 90%预测间隔.
  • 90% confidence interval for the mean response given V2=6 and
  • 90% prediction interval when V2=6.

我使用了以下代码:

X6 <- data.frame(V2=6)
predict(CopierDataRegression, X6, se.fit=TRUE, interval="confidence", level=0.90)
predict(CopierDataRegression, X6, se.fit=TRUE, interval="prediction", level=0.90)

我得到了(87.3, 91.9)(74.5, 104.8)这似乎是正确的,因为PI应该更宽.

and I got (87.3, 91.9) and (74.5, 104.8) which seems to be correct since the PI should be wider.

两者的输出也包括相同的se.fit = 1.39. 我不知道此标准错误是什么. PI与CI相比,标准误差是否应该更大?如何在R中找到这两个不同的标准错误?

The output for both also included se.fit = 1.39 which was the same. I don't understand what this standard error is. Shouldn't the standard error be larger for the PI vs. the CI? How do I find these two different standard errors in R?

数据:

CopierData1 <- structure(list(V1 = c(20L, 60L, 46L, 41L, 12L, 137L, 68L, 89L, 
          4L, 32L, 144L, 156L, 93L, 36L, 72L, 100L, 105L, 131L, 127L, 57L, 
          66L, 101L, 109L, 74L, 134L, 112L, 18L, 73L, 111L, 96L, 123L, 
          90L, 20L, 28L, 3L, 57L, 86L, 132L, 112L, 27L, 131L, 34L, 27L, 
          61L, 77L), V2 = c(2L, 4L, 3L, 2L, 1L, 10L, 5L, 5L, 1L, 2L, 9L, 
          10L, 6L, 3L, 4L, 8L, 7L, 8L, 10L, 4L, 5L, 7L, 7L, 5L, 9L, 7L, 
          2L, 5L, 7L, 6L, 8L, 5L, 2L, 2L, 1L, 4L, 5L, 9L, 7L, 1L, 9L, 2L, 
          2L, 4L, 5L)), .Names = c("V1", "V2"),
          class = "data.frame", row.names = c(NA, -45L))

推荐答案

指定intervallevel自变量时,predict.lm可以返回置信区间(CI)或预测区间(PI).此答案显示了如何在不设置这些参数的情况下获取CI和PI.有两种方法:

When specifying interval and level argument, predict.lm can return confidence interval (CI) or prediction interval (PI). This answer shows how to obtain CI and PI without setting these arguments. There are two ways:

  • 使用来自predict.lm的中间结果;
  • 从头开始做所有事情.
  • use middle-stage result from predict.lm;
  • do everything from scratch.

了解如何同时使用两种方法,可以使您对预测过程有透彻的了解.

Knowing how to work with both ways give you a thorough understand of the prediction procedure.

请注意,我们仅介绍predict.lmtype = "response"(默认)情况. type = "terms"的讨论超出了此答案的范围.

Note that we will only cover the type = "response" (default) case for predict.lm. Discussion of type = "terms" is beyond the scope of this answer.

我在这里收集您的代码,以帮助其他读者复制,粘贴和运行.我还更改了变量名称,以使它们具有更清晰的含义.另外,我将newdat扩展为包括多行,以表明我们的计算是向量化的".

I gather your code here to help other readers to copy, paste and run. I also change variable names so that they have clearer meanings. In addition, I expand the newdat to include more than one rows, to show that our computations are "vectorized".

dat <- structure(list(V1 = c(20L, 60L, 46L, 41L, 12L, 137L, 68L, 89L, 
          4L, 32L, 144L, 156L, 93L, 36L, 72L, 100L, 105L, 131L, 127L, 57L, 
          66L, 101L, 109L, 74L, 134L, 112L, 18L, 73L, 111L, 96L, 123L, 
          90L, 20L, 28L, 3L, 57L, 86L, 132L, 112L, 27L, 131L, 34L, 27L, 
          61L, 77L), V2 = c(2L, 4L, 3L, 2L, 1L, 10L, 5L, 5L, 1L, 2L, 9L, 
          10L, 6L, 3L, 4L, 8L, 7L, 8L, 10L, 4L, 5L, 7L, 7L, 5L, 9L, 7L, 
          2L, 5L, 7L, 6L, 8L, 5L, 2L, 2L, 1L, 4L, 5L, 9L, 7L, 1L, 9L, 2L, 
          2L, 4L, 5L)), .Names = c("V1", "V2"),
          class = "data.frame", row.names = c(NA, -45L))

lmObject <- lm(V1 ~ V2, data = dat)

newdat <- data.frame(V2 = c(6, 7))

以下是predict.lm的输出,稍后将与我们的手动计算进行比较.

The following are the output of predict.lm, to be compared with our manual computations later.

predict(lmObject, newdat, se.fit = TRUE, interval = "confidence", level = 0.90)
#$fit
#        fit       lwr      upr
#1  89.63133  87.28387  91.9788
#2 104.66658 101.95686 107.3763
#
#$se.fit
#       1        2 
#1.396411 1.611900 
#
#$df
#[1] 43
#
#$residual.scale
#[1] 8.913508

predict(lmObject, newdat, se.fit = TRUE, interval = "prediction", level = 0.90)
#$fit
#        fit      lwr      upr
#1  89.63133 74.46433 104.7983
#2 104.66658 89.43930 119.8939
#
#$se.fit
#       1        2 
#1.396411 1.611900 
#
#$df
#[1] 43
#
#$residual.scale
#[1] 8.913508

使用来自predict.lm

的中间结果

Use middle-stage result from predict.lm

## use `se.fit = TRUE`
z <- predict(lmObject, newdat, se.fit = TRUE)
#$fit
#        1         2 
# 89.63133 104.66658 
#
#$se.fit
#       1        2 
#1.396411 1.611900 
#
#$df
#[1] 43
#
#$residual.scale
#[1] 8.913508

什么是se.fit?

z$se.fit是预测平均值z$fit的标准误差,用于构建z$fit的CI.我们还需要具有自由度z$df的t分布的分位数.

z$se.fit is the standard error of the predicted mean z$fit, used to construct CI for z$fit. We also need quantiles of t-distribution with a degree of freedom z$df.

alpha <- 0.90  ## 90%
Qt <- c(-1, 1) * qt((1 - alpha) / 2, z$df, lower.tail = FALSE)
#[1] -1.681071  1.681071

## 90% confidence interval
CI <- z$fit + outer(z$se.fit, Qt)
colnames(CI) <- c("lwr", "upr")
CI
#        lwr      upr
#1  87.28387  91.9788
#2 101.95686 107.3763

我们看到这与predict.lm(, interval = "confidence")一致.

PI的标准错误是什么?

What is the standard error for PI?

PI比CI宽,因为它说明了残差:

PI is wider than CI, as it accounts for residual variance:

variance_of_PI = variance_of_CI + variance_of_residual

请注意,这是逐点定义的.对于非加权线性回归(如您的示例),残余方差在任何地方都是相等的(称为 homoscedasticity ),并且为z$residual.scale ^ 2.因此,PI的标准误差为

Note that this is defined point-wise. For a non-weighted linear regression (as in your example), residual variance is equal everywhere (known as homoscedasticity), and it is z$residual.scale ^ 2. Thus the standard error for PI is

se.PI <- sqrt(z$se.fit ^ 2 + z$residual.scale ^ 2)
#       1        2 
#9.022228 9.058082 

并且PI的构造为

PI <- z$fit + outer(se.PI, Qt)
colnames(PI) <- c("lwr", "upr")
PI
#       lwr      upr
#1 74.46433 104.7983
#2 89.43930 119.8939

我们看到这与predict.lm(, interval = "prediction")一致.

备注

如果您具有权重线性回归,则情况会更加复杂,其中残差在各处均不相等,因此应加权z$residual.scale ^ 2.为拟合值构造PI更为容易(也就是说,在predict.lm中使用type = "prediction"时不设置newdata),因为权重是已知的(使用时必须通过weight参数提供它) lm).对于样本外预测(即,将newdata传递给predict.lm),predict.lm希望您告诉它应该如何加权残差方差.您需要在predict.lm中使用参数pred.varweights,否则您会收到predict.lm的警告,抱怨缺少用于构造PI的信息. ?predict.lm中引用了以下内容:

Things are more complicated if you have a weight linear regression, where the residual variance is not equal everywhere so that z$residual.scale ^ 2 should be weighted. It is easier to construct PI for fitted values (that is, you don't set newdata when using type = "prediction" in predict.lm), because the weights are known (you must have provided it via weight argument when using lm). For out-of-sample prediction (that is, you pass a newdata to predict.lm), predict.lm expects you to tell it how residual variance should be weighted. You need either use argument pred.var or weights in predict.lm, otherwise you get a warning from predict.lm complaining insufficient information for constructing PI. The following are quoted from ?predict.lm:

 The prediction intervals are for a single observation at each case
 in ‘newdata’ (or by default, the data used for the fit) with error
 variance(s) ‘pred.var’.  This can be a multiple of ‘res.var’, the
 estimated value of sigma^2: the default is to assume that future
 observations have the same error variance as those used for
 fitting.  If ‘weights’ is supplied, the inverse of this is used as
 a scale factor.  For a weighted fit, if the prediction is for the
 original data frame, ‘weights’ defaults to the weights used for
 the model fit, with a warning since it might not be the intended
 result.  If the fit was weighted and ‘newdata’ is given, the
 default is to assume constant prediction variance, with a warning.

请注意,CI的构造不受回归类型的影响.

Note that construction of CI is not affected by the type of regression.

基本上,我们想知道如何在z中获得fitse.fitdfresidual.scale.

Basically we want to know how to obtain fit, se.fit, df and residual.scale in z.

预测平均值可以通过矩阵矢量乘积Xp %*% b进行计算,其中Xp是线性预测矩阵,b是回归系数矢量.

The predicted mean can be computed by a matrix-vector multiplication Xp %*% b, where Xp is the linear predictor matrix and b is regression coefficient vector.

Xp <- model.matrix(delete.response(terms(lmObject)), newdat)
b <- coef(lmObject)
yh <- c(Xp %*% b)  ## c() reshape the single-column matrix to a vector
#[1]  89.63133 104.66658

我们看到这与z$fit一致. yh的方差-协方差是Xp %*% V %*% t(Xp),其中Vb的方差-协方差矩阵,可以通过以下公式计算

And we see that this agrees with z$fit. The variance-covariance for yh is Xp %*% V %*% t(Xp), where V is the variance-covariance matrix of b which can be computed by

V <- vcov(lmObject)  ## use `vcov` function in R
#             (Intercept)         V2
# (Intercept)    7.862086 -1.1927966
# V2            -1.192797  0.2333733

yh的完整方差-协方差矩阵无需计算逐点CI或PI.我们只需要它的主要对角线.因此,除了执行diag(Xp %*% V %*% t(Xp))之外,我们还可以通过以下方式更有效地进行

The full variance-covariance matrix of yh is not needed to compute point-wise CI or PI. We only need its main diagonal. So instead of doing diag(Xp %*% V %*% t(Xp)), we can do it more efficiently via

var.fit <- rowSums((Xp %*% V) * Xp)  ## point-wise variance for predicted mean
#       1        2 
#1.949963 2.598222 

sqrt(var.fit)  ## this agrees with `z$se.fit`
#       1        2 
#1.396411 1.611900 

剩余的自由度在拟合模型中很容易获得:

The residual degree of freedom is readily available in the fitted model:

dof <- df.residual(lmObject)
#[1] 43

最后,要计算残差,请使用Pearson估计器:

Finally, to compute residual variance, use Pearson estimator:

sig2 <- c(crossprod(lmObject$residuals)) / dof
# [1] 79.45063

sqrt(sig2)  ## this agrees with `z$residual.scale`
#[1] 8.913508

备注

请注意,在进行加权回归的情况下,sig2的计算应为

Note that in case of weighted regression, sig2 should be computed as

sig2 <- c(crossprod(sqrt(lmObject$weights) * lmObject$residuals)) / dof


附录:模仿predict.lm

的自写函数

在本问答中,从头开始做所有事情"中的代码已被清晰地组织成函数lm_predict.答:带有lm的线性模型:如何获取预测值总和的预测方差.


Appendix: a self-written function that mimics predict.lm

The code in "Do everything from scratch" has been cleanly organized into a function lm_predict in this Q & A: linear model with lm: how to get prediction variance of sum of predicted values.

这篇关于Forecast.lm()如何计算置信区间和预测区间?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆