具有lm()的线性回归:汇总预测值的预测间隔 [英] Linear regression with `lm()`: prediction interval for aggregated predicted values

查看:94
本文介绍了具有lm()的线性回归:汇总预测值的预测间隔的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用predict.lm(fit, newdata=newdata, interval="prediction")来获取预测及其对新观测值的预测间隔(PI).现在,我想基于一个附加变量(即,单个家庭的预测的邮政编码级别上的空间聚合)汇总(总和和均值)这些预测及其PI.

I'm using predict.lm(fit, newdata=newdata, interval="prediction") to get predictions and their prediction intervals (PI) for new observations. Now I would like to aggregate (sum and mean) these predictions and their PI's based on an additional variable (i.e. a spatial aggregation on the zip code level of predictions for single households).

我从StackExchange学习了,您不能仅通过汇总预测间隔的限制来汇总单个预测的预测间隔.这篇文章对理解为什么不能做到这一点非常有帮助,但是我很难把这一点翻译成实际的代码.答案是:

I learned from StackExchange, that you cannot aggregate the prediction intervals of single predictions just by aggregating the limits of the prediction intervals. The post is very helpful to understand why this can't be done, but I have a hard time translating this bit into actual code. The answer reads:

这是一个可复制的示例:

Here's a reproducible example:

library(dplyr)
set.seed(123)

data(iris)

#Split dataset in training and prediction set
smp_size <- floor(0.75 * nrow(iris))
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
train <- iris[train_ind, ]
pred <- iris[-train_ind, ]

#Fit regression model
fit1 <- lm(Petal.Width ~ Petal.Length, data=train)

#Fit multiple linear regression model
fit2 <- lm(Petal.Width ~ Petal.Length + Sepal.Width + Sepal.Length, data=train)

#Predict Pedal.Width for new data incl prediction intervals for each prediction
predictions1<-predict(fit1, newdata=pred, interval="prediction")
predictions2<-predict(fit2, newdata=pred, interval="prediction")

# Aggregate data by summing predictions for species
#NOT correct for prediction intervals
predictions_agg1<-data.frame(predictions1,Species=pred$Species) %>%
  group_by(Species) %>%
  summarise_all(funs(sum,mean))

predictions_agg2<-data.frame(predictions2,Species=pred$Species) %>%
  group_by(Species) %>%
  summarise_all(funs(sum,mean))

我找不到一个好的教程或软件包来描述如何在使用predict.lm()时正确地将预测及其PI的PI汇总在一起.那里有东西吗?如果您能为我指出如何在R中执行此操作的正确方向,将不胜感激.

I couldn't find a good tutorial or package which describes how to properly aggregate predictions and their PI's in R when using predict.lm(). Is there something out there? Would highly appreciate if you could point me in the right direction on how to do this in R.

推荐答案

您的问题与我2年前回答的线程密切相关:具有lm的线性模型:如何获取预测值总和的预测方差.它提供了 Glen_b关于交叉验证的答案的R实现.感谢您引用交叉验证线程;我不知道也许我可以在此处留下链接堆栈溢出线程的注释.

Your question is closely related to a thread I answered 2 years ago: linear model with `lm`: how to get prediction variance of sum of predicted values. It provides an R implementation of Glen_b's answer on Cross Validated. Thanks for quoting that Cross Validated thread; I didn't know it; perhaps I can leave a comment there linking the Stack Overflow thread.

我已经完善了我的原始答案,将逐行代码干净地包装到易于使用的函数lm_predictagg_pred中.然后,将解决问题的过程简化为按组应用这些功能.

I have polished my original answer, wrapping up line-by-line code cleanly into easy-to-use functions lm_predict and agg_pred. Solving your question is then simplified to applying those functions by group.

考虑问题中的iris示例,并使用第二个模型fit2进行演示.

Consider the iris example in your question, and the second model fit2 for demonstration.

set.seed(123)
data(iris)

#Split dataset in training and prediction set
smp_size <- floor(0.75 * nrow(iris))
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
train <- iris[train_ind, ]
pred <- iris[-train_ind, ]

#Fit multiple linear regression model
fit2 <- lm(Petal.Width ~ Petal.Length + Sepal.Width + Sepal.Length, data=train)

我们将pred按组Species划分,然后在所有子数据帧上应用lm_predict(带有diag = FALSE).

We split pred by group Species, then apply lm_predict (with diag = FALSE) on all sub data frames.

oo <- lapply(split(pred, pred$Species), lm_predict, lmObject = fit2, diag = FALSE)

要使用agg_pred,我们需要指定一个权重向量,其长度等于数据数量.我们可以通过查询每个oo[[i]]fit的长度来确定:

To use agg_pred we need to specify a weight vector, whose length equals to the number of data. We can determine this by consulting the length of fit in each oo[[i]]:

n <- lengths(lapply(oo, "[[", 1))
#setosa versicolor  virginica 
#    11         13         14 

如果聚合操作是总和,我们会这样做

If aggregation operation is sum, we do

w <- lapply(n, rep.int, x = 1)
#List of 3
# $ setosa    : num [1:11] 1 1 1 1 1 1 1 1 1 1 ...
# $ versicolor: num [1:13] 1 1 1 1 1 1 1 1 1 1 ...
# $ virginica : num [1:14] 1 1 1 1 1 1 1 1 1 1 ...

SUM <- Map(agg_pred, w, oo)
SUM[[1]]  ## result for the first group, for example
#$mean
#[1] 2.499728
#
#$var
#[1] 0.1271554
#
#$CI
#   lower    upper 
#1.792908 3.206549 
#
#$PI
#   lower    upper 
#0.999764 3.999693 

sapply(SUM, "[[", "CI")  ## some nice presentation for CI, for example
#        setosa versicolor virginica
#lower 1.792908   16.41526  26.55839
#upper 3.206549   17.63953  28.10812

如果聚合操作平均,我们将w调整为n并调用agg_pred.

If aggregation operation is average, we rescale w by n and call agg_pred.

w <- mapply("/", w, n)
#List of 3
# $ setosa    : num [1:11] 0.0909 0.0909 0.0909 0.0909 0.0909 ...
# $ versicolor: num [1:13] 0.0769 0.0769 0.0769 0.0769 0.0769 ...
# $ virginica : num [1:14] 0.0714 0.0714 0.0714 0.0714 0.0714 ...

AVE <- Map(agg_pred, w, oo)
AVE[[2]]  ## result for the second group, for example
#$mean
#[1] 1.3098
#
#$var
#[1] 0.0005643196
#
#$CI
#    lower    upper 
#1.262712 1.356887 
#
#$PI
#   lower    upper 
#1.189562 1.430037 

sapply(AVE, "[[", "PI")  ## some nice presentation for CI, for example
#          setosa versicolor virginica
#lower 0.09088764   1.189562  1.832255
#upper 0.36360845   1.430037  2.072496


太好了!太感谢了!我忘了提到一件事:在我的实际应用中,我需要汇总约300,000个预测,这将创建一个完整的方差-协方差矩阵,其大小约为700GB.您是否知道是否有一种计算上更有效的方法可以直接求出方差-协方差矩阵的和?

This is great! Thank you so much! There is one thing I forgot to mention: in my actual application I need to sum ~300,000 predictions which would create a full variance-covariance matrix which is about ~700GB in size. Do you have any idea if there is a computationally more efficient way to directly get to the sum of the variance-covariance matrix?

使用原始Q&的修订版中提供的fast_agg_pred功能.答:让我们从头开始.

Use the fast_agg_pred function provided in the revision of the original Q & A. Let's start it all over.

set.seed(123)
data(iris)

#Split dataset in training and prediction set
smp_size <- floor(0.75 * nrow(iris))
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
train <- iris[train_ind, ]
pred <- iris[-train_ind, ]

#Fit multiple linear regression model
fit2 <- lm(Petal.Width ~ Petal.Length + Sepal.Width + Sepal.Length, data=train)

## list of new data
newdatlist <- split(pred, pred$Species)

n <- sapply(newdatlist, nrow)
#setosa versicolor  virginica 
#    11         13         14 

如果聚合操作是总和,我们会这样做

If aggregation operation is sum, we do

w <- lapply(n, rep.int, x = 1)
SUM <- mapply(fast_agg_pred, w, newdatlist,
              MoreArgs = list(lmObject = fit2, alpha = 0.95),
              SIMPLIFY = FALSE)

如果聚合操作是平均水平,我们会这样做

If aggregation operation is average, we do

w <- mapply("/", w, n)
AVE <- mapply(fast_agg_pred, w, newdatlist,
              MoreArgs = list(lmObject = fit2, alpha = 0.95),
              SIMPLIFY = FALSE)

请注意,在这种情况下我们不能使用Map,因为我们需要为fast_agg_pred提供更多参数.在这种情况下,将mapplyMoreArgsSIMPLIFY一起使用.

Note that we can't use Map in this case as we need to provide more arguments to fast_agg_pred. Use mapply in this situation, with MoreArgs and SIMPLIFY.

这篇关于具有lm()的线性回归:汇总预测值的预测间隔的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆