两条geom_smooth()行之间的差异 [英] Difference between two geom_smooth() lines

查看:164
本文介绍了两条geom_smooth()行之间的差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我为我的数据作了一个图,现在我想让geom_smooth()估计的每个x的y的差值。不幸的是,有一个



编辑



提出了两个建议,但我仍然不知道如何计算差异。



第一个建议是从ggplot对象访问数据。我是这样做的

  pb<-ggplot_build(p)
pb [[ data]] [[ 1]]

这种方法行之有效,但数据使用的x值并不相同组。例如,第一组的第一个x值为-3.21318853,但是第二组没有x的-3.21318853,因此,我无法计算两组之间-3.21318853的y之差



第二个建议是查看geom_smooth()中使用的公式。程序包说明说: loess()用于少于1000个观察;否则,mgcv :: gam()与公式= y〜s(x,bs = cs)一起使用。我的N大于60,000,因此默认情况下使用gam。我对gam不熟悉; 考虑到刚才描述的内容,谁能提供一个简短的答案来计算两行之间的差异



R代码

  library( ggplot2)#库ggplot 
set.seed(1)#使示例可复制
n<-5000 #设置样本量
df<-data.frame(x = rnorm(n),g = factor(rep(c(0,1),n / 2)))#生成数据
df $ y<-NA#在df
df $ y [df $ g == 0]中包含y<--df $ x [df $ g == 0] ** 2 + rnorm(sum(df $ g == 0))* 5#组g = 0的y
df $ y [df $ g == 1]< -2 + df $ x [df $ g == 1] ** 2 + rnorm(sum(df $ g == 1))* 5#y对于g = 1(带有拦截2)
ggplot(df,aes(x,y,col = g))+ geom_smooth()+ geom_point (alpha = .1)#作图


解决方案

我在上面的评论中提到,您最好在 ggplot 之外进行此操作,而要使用两个平滑的完整模型来进行计算,从中可以计算出差异的不确定性,等等。

这基本上是



与以下评估一致:该评估显示,具有组级平滑度的模型与具有不同组均值的模型相比,提供的拟合度要好得多,而在 x中仅提供一个通用的平滑度

  r $> m0--gam(y〜g + s(x),data = df,method = REML)

r $> AIC(m0,m)
df AIC
m0 9.68355 30277.93
m 14.70675 30285.02

r $> anova(m0,m,test ='F')
偏差表分析

模型1:y〜g + s(x)
模型2:y〜g + s(x,by = g)
残基Df Resid。 Dev Df偏差F Pr(> F)
1 4990.1 124372
2 4983.9 124298 6.1762 73.591 0.4781 0.8301



包装



我提到的博客文章具有将以上步骤包装为简单函数的功能, smooth_diff( )

  smooth_diff<-函数(model,newdata,f1,f2,var,alpha = 0.05,
无条件= FALSE){
xp<-预测(model,newdata = newdata,type ='lpmatrix')
c1<-grepl(f1,colnames(xp) )
c2<-grepl(f2,colnames(xp))
r1<-newdata [[var]] == f1
r2<-newdata [[var]] = = f2
##来自比较
的数据xp的差异行X <--xp [r1,]-xp [r2,]
##将与样条线相关的X的cols归零其他海湾
X [,! (c1 | c2)]<-0
##将参数cols
X [,!grepl('^ s\\(',colnames(xp))]零清零- 0
dif<-X%*%coef(model)
se<-sqrt(rowSums((X%*%vcov(model,无条件=无条件))* X))
crit<-qt(alpha / 2,df.residual(model),lower.tail = FALSE)
upr<-dif +(crit * se)
lwr<-dif-( crit * se)
data.frame(pair = paste(f1,f2,sep ='-'),
diff = dif,
se = se,
upper = upr ,
lower = lwr)
}

使用此功能,我们可以重复整个分析并绘制以下差异:

  out<-smooth_diff(m,pdat,'0','1', 'g')
out<-cbind(x = with(df,seq(min(x),max(x),length = 200)),
out)

ggplot(out,aes(x = x,y = diff))+
geom_ribbon(aes(ymin = lower,ymax = upper,x = x),alpha = 0.2)+
geom_line( )

在这里我不会显示该图,因为它与上面的图相同,除了轴标签。


I made a plot for my data and am now I would like to have the difference in y for every x that was estimated by geom_smooth(). There is a similiar question which unfortunately has no answer. For example, how to get the differences for the following plot (data below):

EDIT

Two suggestions were made but I still don't know how to calculate the differences.

First suggestion was to access the data from the ggplot object. I did so with

pb <- ggplot_build(p)
pb[["data"]][[1]]

That approach kind of works, but the data doesn't use the same x values for the groups. For example, the first x value of the first group is -3.21318853, but there is no x of -3.21318853 for the second group, hence, I can not calculate the difference in y for -3.21318853 between both groups

Second suggestion was to see what formula is used in geom_smooth(). The package description says that "loess() is used for less than 1,000 observations; otherwise mgcv::gam() is used with formula = y ~ s(x, bs = "cs")". My N is more than 60,000, hence, gam is used by default. I am not familiar with gam; can anyone provide a short answer how to calculate the difference between the two lines considering the things just described?

R Code

library("ggplot2") # library ggplot
set.seed(1) # make example reproducible
n <- 5000 # set sample size
df <- data.frame(x= rnorm(n), g= factor(rep(c(0,1), n/2))) # generate data
df$y <- NA # include y in df
df$y[df$g== 0] <- df$x[df$g== 0]**2 + rnorm(sum(df$g== 0))*5 # y for group g= 0
df$y[df$g== 1] <-2 + df$x[df$g== 1]**2 + rnorm(sum(df$g== 1))*5 # y for g= 1 (with intercept 2)
ggplot(df, aes(x, y, col= g)) + geom_smooth() + geom_point(alpha= .1) # make a plot

解决方案

As I mentioned in the comments above, you really are better off doing this outside of ggplot and instead do it with a full model of the two smooths from which you can compute uncertainties on the difference, etc.

This is basically a short version of a blog post that I wrote a year or so back.

OP's exmaple data

set.seed(1) # make example reproducible
n <- 5000 # set sample size
df <- data.frame(x= rnorm(n), g= factor(rep(c(0,1), n/2))) # generate data
df$y <- NA # include y in df
df$y[df$g== 0] <- df$x[df$g== 0]**2 + rnorm(sum(df$g== 0))*5 # y for group g= 0
df$y[df$g== 1] <-2 + df$x[df$g== 1]**2 + rnorm(sum(df$g== 1))*5 # y for g= 1 (with intercept 2)

Start by fitting the model for the example data:

library("mgcv")
m <- gam(y ~ g + s(x, by = g), data = df, method = "REML")

Here I'm fitting a GAM with a factor-smooth interaction (the by bit) and for this model we need to also include g as a parametric effect as the group-specific smooths are both centred about 0 so we need to include the group means in the parametric part of the model.

Next we need a grid of data along the x variable at which we will estimate the difference between the two estimated smooths:

pdat <- with(df, expand.grid(x = seq(min(x), max(x), length = 200),
                            g = c(0,1)))
pdat <- transform(pdat, g = factor(g))

then we use this prediction data to generate the Xp matrix, which is a matrix that maps values of the covariates to values of the basis expansion for the smooths; we can manipulate this matrix to get the difference smooth that we want:

xp <- predict(m, newdata = pdat, type = "lpmatrix")

Next some code to identify which rows and columns in xp belong to the smooths for the respective levels of g; as there are only two levels and only a single smooth term in the model, this is entirely trivial but for more complex models this is needed and it is important to get the smooth component names right for the grep() bits to work.

## which cols of xp relate to splines of interest?
c1 <- grepl('g0', colnames(xp))
c2 <- grepl('g1', colnames(xp))
## which rows of xp relate to sites of interest?
r1 <- with(pdat, g == 0)
r2 <- with(pdat, g == 1)

Now we can difference the rows of xp for the pair of levels we are comparing

## difference rows of xp for data from comparison
X <- xp[r1, ] - xp[r2, ]

As we focus on the difference, we need to zero out all the column not associated with the selected pair of smooths, which includes any parametric terms.

## zero out cols of X related to splines for other lochs
X[, ! (c1 | c2)] <- 0
## zero out the parametric cols
X[, !grepl('^s\\(', colnames(xp))] <- 0

(In this example, these two lines do exactly the same thing, but in more complex examples both are needed.)

Now we have a matrix X which contains the difference between the two basis expansions for the pair of smooths we're interested in, but to get this in terms of fitted values of the response y we need to multiply this matrix by the vector of coefficients:

## difference between smooths
dif <- X %*% coef(m)

Now dif contains the difference between the two smooths.

We can use X again and covariance matrix of the model coefficients to compute the standard error of this difference and thence a 95% (in this case) confidence interval for the estimate difference.

## se of difference
se <- sqrt(rowSums((X %*% vcov(m)) * X))

## confidence interval on difference
crit <- qt(.975, df.residual(m))
upr <- dif + (crit * se)
lwr <- dif - (crit * se)

Note that here with the vcov() call we're using the empirical Bayesian covariance matrix but not the one corrected for having chosen the smoothness parameters. The function I show shortly allows you to account for this additional uncertainty via argument unconditional = TRUE.

Finally we gather the results and plot:

res <- data.frame(x = with(df, seq(min(x), max(x), length = 200)),
                  dif = dif, upr = upr, lwr = lwr)

ggplot(res, aes(x = x, y = dif)) +
  geom_ribbon(aes(ymin = lwr, ymax = upr, x = x), alpha = 0.2) +
  geom_line()

This produces

Which is consistent with an assessment that shows the model with the group-level smooths doesn't provide substantially better fit than a model with different group means but only single common smoother in x:

r$> m0 <- gam(y ~ g + s(x), data = df, method = "REML")

r$> AIC(m0, m)
         df      AIC
m0  9.68355 30277.93
m  14.70675 30285.02

r$> anova(m0, m, test = 'F')
Analysis of Deviance Table

Model 1: y ~ g + s(x)
Model 2: y ~ g + s(x, by = g)
  Resid. Df Resid. Dev     Df Deviance      F Pr(>F)
1    4990.1     124372                              
2    4983.9     124298 6.1762   73.591 0.4781 0.8301

Wrapping up

The blog post I mentioned has a function which wraps the steps above into a simple function, smooth_diff():

smooth_diff <- function(model, newdata, f1, f2, var, alpha = 0.05,
                        unconditional = FALSE) {
    xp <- predict(model, newdata = newdata, type = 'lpmatrix')
    c1 <- grepl(f1, colnames(xp))
    c2 <- grepl(f2, colnames(xp))
    r1 <- newdata[[var]] == f1
    r2 <- newdata[[var]] == f2
    ## difference rows of xp for data from comparison
    X <- xp[r1, ] - xp[r2, ]
    ## zero out cols of X related to splines for other lochs
    X[, ! (c1 | c2)] <- 0
    ## zero out the parametric cols
    X[, !grepl('^s\\(', colnames(xp))] <- 0
    dif <- X %*% coef(model)
    se <- sqrt(rowSums((X %*% vcov(model, unconditional = unconditional)) * X))
    crit <- qt(alpha/2, df.residual(model), lower.tail = FALSE)
    upr <- dif + (crit * se)
    lwr <- dif - (crit * se)
    data.frame(pair = paste(f1, f2, sep = '-'),
               diff = dif,
               se = se,
               upper = upr,
               lower = lwr)
}

Using this function we can repeat the entire analysis and plot the difference with:

out <- smooth_diff(m, pdat, '0', '1', 'g')
out <- cbind(x = with(df, seq(min(x), max(x), length = 200)),
             out)

ggplot(out, aes(x = x, y = diff)) +
  geom_ribbon(aes(ymin = lower, ymax = upper, x = x), alpha = 0.2) +
  geom_line()

I won't show the plot here as it is identical to that shown above except for the axis labels.

这篇关于两条geom_smooth()行之间的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆