R 中的 Predict.lm() - 如何在拟合值周围获得非常量的预测带 [英] Predict.lm() in R - how to get nonconstant prediction bands around fitted values
问题描述
所以我目前正在尝试绘制线性模型的置信区间.我发现我应该为此使用 predict.lm(),但是我在真正理解该函数时遇到了一些问题,而且我不喜欢在不知道发生了什么的情况下使用函数.我找到了几个关于这个主题的方法,但只有相应的 R 代码,没有真正的解释.这是函数本身:
So I am currently trying to draw the confidence interval for a linear model. I found out I should use predict.lm() for this, but I have a few problems really understanding the function and I do not like using functions without knowing what's happening. I found several how-to's on this subject, but only with the corresponding R-code, no real explanation. This is the function itself:
## S3 method for class 'lm'
predict(object, newdata, se.fit = FALSE, scale = NULL, df = Inf,
interval = c("none", "confidence", "prediction"),
level = 0.95, type = c("response", "terms"),
terms = NULL, na.action = na.pass,
pred.var = res.var/weights, weights = 1, ...)
现在,我难以理解的是:
Now, what I've trouble understanding:
1) newdata
An optional data frame in which to look for variables
with which to predict. If omitted, the fitted values are used.
似乎每个人都为此使用 newdata,但我不太明白为什么.为了计算置信区间,我显然需要这个区间的数据(如观察数、x 的平均值等),所以不能是它的意思.但是:这是什么意思?
Everyone seems to use newdata for this, but I cannot quite understand why. For calculating the confidence interval I obviously need the data which this interval is for (like the # of observations, mean of x etc), so cannot be what is meant by it. But then: What is does it mean?
2) 间隔
区间计算类型.好吧..但是无"是什么意思?
okay.. but what is "none" for?
<代码>3a) 类型
预测类型(响应或模型项).<代码>3b) 条款
如果type="terms",是哪些词(默认是所有词)3a:我可以通过这个获得模型中一个特定变量的置信区间吗?如果是这样,那么 3b 是什么?如果我可以在 3a 中指定该术语,那么在 3b 中再次执行就没有意义了..所以我想我又错了,但我不知道为什么.
3a: Can I by that get the confidence interval for one specific variable in my model? And if so, what is 3b for then? If I can specify the term in 3a, it wouldn't make sense to do it in 3b again.. so I guess I'm wrong again, but I cannot figure out why.
我想你们中的一些人可能会想:为什么不试试这个呢?我会(即使它可能无法解决这里的所有问题),但我现在不知道该怎么做.由于我现在不知道 newdata 的用途,我不知道如何使用它,如果我尝试,我不会得到正确的置信区间.不知何故,如何选择这些数据非常重要,但我就是不明白!
I guess some of you might think: Why don't just try this out? And I would (even if it would maybe not solve everything here), but I right now don't know how to. As I do not now what the newdata is for, I don't know how to use it and if I try, I do not get the right confidence interval. Somehow it is very important how you choose that data, but I just don't understand!
我想补充一点,我的目的是了解 predict.lm 的工作原理.我的意思是我不明白它是否像我认为的那样工作.也就是说,它计算 y-hat(预测值),然后对区间的每个 upr/lwr 边界使用加法/减法来计算几个数据点(然后看起来像置信线)??然后我会不明白为什么需要在新数据中具有与线性模型中相同的长度.
I want to add that my intention is to understand how predict.lm works. By that I mean I don't understand if it works the way I think it does. That is it calculates y-hat (predicted values) and than uses adds/subtracts for each the upr/lwr-bounds of the interval to calculate several datapoints(looking like a confidence-line then) ?? Then I would undestand why it is necessary to have the same lenght in the newdata as in the linear model.
推荐答案
补一些数据:
d <- data.frame(x=c(1,4,5,7), y=c(0.8,4.2,4.7,8))
拟合模型:
lm1 <- lm(y~x,data=d)
原始 x 值的置信度和预测区间:
Confidence and prediction intervals with the original x values:
p_conf1 <- predict(lm1,interval="confidence") p_pred1 <- predict(lm1,interval="prediction")
会议.和预.具有新 x 值的间隔(外推和比原始数据更精细/更均匀的间隔):
Conf. and pred. intervals with new x values (extrapolation and more finely/evenly spaced than original data):
nd <- data.frame(x=seq(0,8,length=51)) p_conf2 <- predict(lm1,interval="confidence",newdata=nd) p_pred2 <- predict(lm1,interval="prediction",newdata=nd)
将所有内容绘制在一起:
Plotting everything together:
par(las=1,bty="l") ## cosmetics plot(y~x,data=d,ylim=c(-5,12),xlim=c(0,8)) ## data abline(lm1) ## fit matlines(d$x,p_conf1[,c("lwr","upr")],col=2,lty=1,type="b",pch="+") matlines(d$x,p_pred1[,c("lwr","upr")],col=2,lty=2,type="b",pch=1) matlines(nd$x,p_conf2[,c("lwr","upr")],col=4,lty=1,type="b",pch="+") matlines(nd$x,p_pred2[,c("lwr","upr")],col=4,lty=2,type="b",pch=1)
使用新数据可以在原始数据之外进行推断;此外,如果原始数据稀疏或不均匀,预测区间(不是直线)可能无法通过原始 x 值之间的线性插值很好地近似......
Using new data allows for extrapolation beyond the original data; also, if the original data are sparsely or unevenly spaced, the prediction intervals (which are not straight lines) may not be well approximated by linear interpolation between the original x values ...
我不太清楚你所说的我的模型中一个特定变量的置信区间"是什么意思;如果你想要一个参数的置信区间,那么你应该使用
confint
.如果您只想根据某些参数的变化来预测变化(忽略其他参数引起的不确定性),那么您确实想使用type="terms"
.I'm not quite sure what you mean by the "confidence interval for one specific variable in my model"; if you want confidence intervals on a parameter, then you should use
confint
. If you want predictions for the changes based only on some of the parameters changing (ignoring the uncertainty due to the other parameters), then you do indeed want to usetype="terms"
.interval="none"
(默认值)只是告诉 R 不要费心计算任何置信度或预测区间,而只返回预测值.interval="none"
(the default) just tells R not to bother computing any confidence or prediction intervals, and to return just the predicted values.这篇关于R 中的 Predict.lm() - 如何在拟合值周围获得非常量的预测带的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!