stat_smooth gam与gam {mgcv}不同 [英] stat_smooth gam not the same as gam {mgcv}

查看:92
本文介绍了stat_smooth gam与gam {mgcv}不同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在ggplot2中使用stat_smooth函数,决定我想要拟合优度",并为此使用了mgvc gam.我想到我应该检查以确保它们是相同的模型(stat_smooth与mgvc gam),因此我使用下面的代码进行检查.似乎它们有不同的结果,如该图所证明(

您始终可以使用 ggplot_build 来查看绘图对象,并查看构成它的所有部分(我不在这里显示结果,因为它占用了很多空间,但是输出将打印到您的控制台).

ggplot_build(p1)

stat_smooth 的预测数据集是数据集列表中的第二个.

ggplot_build(p1)$ data [[2]]

但是看看该数据集有多少行:

  nrow(ggplot_build(p1)$ data [[2]])[1] 80 

n 参数的默认设置为80,但是数据集中有365行.那么,如果将 n 更改为365,会发生什么呢?我将使平滑线更胖,以便您实际上可以看到它(蓝色).

 (p2 = ggplot(dat,aes(x,y))+geom_point()+geom_smooth(方法="gam",公式= y〜s(x,k = 100),n = 365,大小= 2)+geom_line(aes(y = predgam))) 

  nrow(ggplot_build(p2)$ data [[2]])[1] 365 

如果查看 stat_smooth 帮助页面的详细信息"部分中提到的 predictdf 函数的代码,您会发现在以下情况下未使用原始数据集:做出预测.取而代之的是从原始的解释变量中得出一个序列.当使用小的数据集时,这是非常重要的,并且您需要更多的预测点才能使线看起来平滑.但是,在您的情况下,原始数据集已经是 x 的良好平滑序列,因此使用 n = 365 可以从 stat_smooth 中获得相同的预测原始数据集确实如此.

您可以看到 predictdf 的代码Plot: stat_smoother gam (red), mgcv gam (black)). However, I don't know why they have different results. Is it that some default parameter is different between the two? Is is that gam is being run on a numeric x and stat_smooth is being run with a POSIXct x (if so - I don't know what to do about that)? It looks like stat_smooth is smoother, but the k values are the same...

I think there are several posts on how to plot gam outputs in ggplot2, but I'd really like to know why stat_smooth and mgcv are giving different results in the first place. I am very new to GAM (and R), so it's quite possible I'm missing something easy. However, I did google and search this forum before asking.

My data is a bit big to easily share, so I used a sample dataset - I've put the source in the code, as well as a dput() below everything, and my sessionInfo() after that.

I have tried to make a quality question, but it is only my second one. Ever. So, constructive criticism is appreciated.

Thank you!

library(readxl)
library(data.table)
library(ggplot2)
library(scales)
library(mgcv)

stackOF_data <- read_excel("mean-daily-flow-cumecs-vatnsdals.xlsx", sheet = "Data")
stackOF_data <- data.table(stackOF_data)
stackOF_data <- stackOF_data[,.(timeseries=as.POSIXct(Date,format("%Y-%m-%d")),mdf)]

a <- stackOF_data[,.(x=as.numeric(timeseries),y=mdf)]
a1 <- gam(y~s(x, k=100, bs="cs"),data=a) 
a2=data.table(gam_mdf= predict(a1,a))
a2=cbind(timeseries=stackOF_data$timeseries,a2)

# see if predict and actual are the same
p <- ggplot() + 
geom_line(data = a2, aes(x = timeseries, y = gam_mdf), size=1)+
scale_color_manual(values=c("black","magenta"))+
scale_y_continuous()+
scale_x_datetime(labels = date_format("%Y-%m-%d"), breaks = "1 month", minor_breaks = "1 week")+ 
theme(axis.text.x=element_text(angle=50, size=10,hjust=1))+
stat_smooth(data = stackOF_data, aes(x = (timeseries), y = mdf),method="gam", formula=y~s(x,k=100, bs="cs"), col="red", se=FALSE, size=1)
p

# data from: https://datamarket.com/data/set/235m/mean-daily-flow-cumecs-vatnsdalsa-river-1-jan-1972-31-dec-1974#!ds=235m&display=line&s=14l

> dput(a)
structure(list(x = c(126230400, 126316800, 126403200, 126489600, 
126576000, 126662400, 126748800, 126835200, 126921600, 127008000, 
127094400, 127180800, 127267200, 127353600, 127440000, 127526400, 
127612800, 127699200, 127785600, 127872000, 127958400, 128044800, 
128131200, 128217600, 128304000, 128390400, 128476800, 128563200, 
128649600, 128736000, 128822400, 128908800, 128995200, 129081600, 
129168000, 129254400, 129340800, 129427200, 129513600, 129600000, 
129686400, 129772800, 129859200, 129945600, 130032000, 130118400, 
130204800, 130291200, 130377600, 130464000, 130550400, 130636800, 
130723200, 130809600, 130896000, 130982400, 131068800, 131155200, 
131241600, 131328000, 131414400, 131500800, 131587200, 131673600, 
131760000, 131846400, 131932800, 132019200, 132105600, 132192000, 
132278400, 132364800, 132451200, 132537600, 132624000, 132710400, 
132796800, 132883200, 132969600, 133056000, 133142400, 133228800, 
133315200, 133401600, 133488000, 133574400, 133660800, 133747200, 
133833600, 133920000, 134006400, 134092800, 134179200, 134265600, 
134352000, 134438400, 134524800, 134611200, 134697600, 134784000, 
134870400, 134956800, 135043200, 135129600, 135216000, 135302400, 
135388800, 135475200, 135561600, 135648000, 135734400, 135820800, 
135907200, 135993600, 136080000, 136166400, 136252800, 136339200, 
136425600, 136512000, 136598400, 136684800, 136771200, 136857600, 
136944000, 137030400, 137116800, 137203200, 137289600, 137376000, 
137462400, 137548800, 137635200, 137721600, 137808000, 137894400, 
137980800, 138067200, 138153600, 138240000, 138326400, 138412800, 
138499200, 138585600, 138672000, 138758400, 138844800, 138931200, 
139017600, 139104000, 139190400, 139276800, 139363200, 139449600, 
139536000, 139622400, 139708800, 139795200, 139881600, 139968000, 
140054400, 140140800, 140227200, 140313600, 140400000, 140486400, 
140572800, 140659200, 140745600, 140832000, 140918400, 141004800, 
141091200, 141177600, 141264000, 141350400, 141436800, 141523200, 
141609600, 141696000, 141782400, 141868800, 141955200, 142041600, 
142128000, 142214400, 142300800, 142387200, 142473600, 142560000, 
142646400, 142732800, 142819200, 142905600, 142992000, 143078400, 
143164800, 143251200, 143337600, 143424000, 143510400, 143596800, 
143683200, 143769600, 143856000, 143942400, 144028800, 144115200, 
144201600, 144288000, 144374400, 144460800, 144547200, 144633600, 
144720000, 144806400, 144892800, 144979200, 145065600, 145152000, 
145238400, 145324800, 145411200, 145497600, 145584000, 145670400, 
145756800, 145843200, 145929600, 146016000, 146102400, 146188800, 
146275200, 146361600, 146448000, 146534400, 146620800, 146707200, 
146793600, 146880000, 146966400, 147052800, 147139200, 147225600, 
147312000, 147398400, 147484800, 147571200, 147657600, 147744000, 
147830400, 147916800, 148003200, 148089600, 148176000, 148262400, 
148348800, 148435200, 148521600, 148608000, 148694400, 148780800, 
148867200, 148953600, 149040000, 149126400, 149212800, 149299200, 
149385600, 149472000, 149558400, 149644800, 149731200, 149817600, 
149904000, 149990400, 150076800, 150163200, 150249600, 150336000, 
150422400, 150508800, 150595200, 150681600, 150768000, 150854400, 
150940800, 151027200, 151113600, 151200000, 151286400, 151372800, 
151459200, 151545600, 151632000, 151718400, 151804800, 151891200, 
151977600, 152064000, 152150400, 152236800, 152323200, 152409600, 
152496000, 152582400, 152668800, 152755200, 152841600, 152928000, 
153014400, 153100800, 153187200, 153273600, 153360000, 153446400, 
153532800, 153619200, 153705600, 153792000, 153878400, 153964800, 
154051200, 154137600, 154224000, 154310400, 154396800, 154483200, 
154569600, 154656000, 154742400, 154828800, 154915200, 155001600, 
155088000, 155174400, 155260800, 155347200, 155433600, 155520000, 
155606400, 155692800, 155779200, 155865600, 155952000, 156038400, 
156124800, 156211200, 156297600, 156384000, 156470400, 156556800, 
156643200, 156729600, 156816000, 156902400, 156988800, 157075200, 
157161600, 157248000, 157334400, 157420800, 157507200, 157593600, 
157680000), y = c(4.65, 4.65, 4.65, 4.48, 5.16, 5.52, 5.34, 5.34, 
4.82, 4.65, 4.48, 4.31, 4.31, 4.31, 4.14, 3.82, 3.98, 3.98, 4.31, 
5.71, 6.5, 6.3, 5.71, 5.71, 5.16, 4.65, 4.14, 3.98, 4.48, 4.48, 
4.31, 4.65, 4.31, 3.98, 3.98, 3.98, 3.98, 3.98, 3.98, 3.82, 3.67, 
3.67, 3.98, 3.98, 3.82, 3.82, 3.82, 4.14, 5.9, 4.48, 3.98, 3.98, 
3.82, 3.67, 3.67, 3.67, 4.65, 3.98, 4.31, 4.31, 3.67, 4.31, 6.1, 
7.3, 7.5, 7.5, 8.14, 10.8, 16.1, 14.8, 12.5, 9.9, 8.14, 6.9, 
6.1, 5.34, 5.16, 4.99, 4.99, 4.99, 4.99, 5.52, 6.3, 7.3, 6.9, 
5.9, 5.71, 5.71, 8.58, 31.5, 33.7, 18.4, 11.3, 16.1, 32.9, 45.3, 
54, 25.7, 18, 15.9, 15.6, 14.5, 15.9, 35.9, 37.5, 29.4, 27.5, 
30.1, 27.5, 30.8, 29.4, 22, 20.1, 35.9, 36.7, 32.9, 22, 18, 15.9, 
15.2, 14.8, 13, 12.7, 12.5, 11, 9.68, 8.8, 7.92, 7.3, 6.9, 7.3, 
10.3, 11, 11.3, 11.9, 12.5, 13.6, 12.2, 10.8, 9.9, 9.46, 8.8, 
7.5, 7.1, 7.71, 7.1, 6.1, 5.34, 5.34, 5.34, 5.52, 5.52, 6.3, 
6.7, 6.5, 5.9, 5.71, 5.9, 5.71, 5.52, 7.3, 7.5, 7.1, 7.3, 6.7, 
6.9, 7.3, 7.5, 10.8, 11.6, 8.58, 7.92, 7.1, 6.7, 6.5, 6.1, 5.9, 
5.9, 5.71, 5.52, 5.52, 5.52, 5.9, 5.9, 5.71, 5.52, 5.52, 5.34, 
5.34, 5.52, 6.5, 6.5, 5.71, 5.34, 5.16, 4.99, 4.82, 4.82, 4.99, 
4.82, 4.82, 4.82, 4.82, 4.82, 4.65, 4.48, 4.48, 4.31, 4.31, 4.14, 
4.14, 4.31, 4.48, 4.31, 4.31, 4.31, 4.99, 5.71, 6.3, 6.1, 6.1, 
5.9, 5.71, 5.52, 5.52, 5.52, 5.52, 5.52, 5.34, 5.34, 5.52, 5.52, 
5.52, 5.34, 5.34, 5.52, 5.34, 5.52, 5.52, 5.34, 5.34, 5.34, 5.34, 
5.71, 5.9, 6.3, 6.9, 7.5, 6.5, 6.1, 6.1, 5.9, 6.1, 6.1, 5.9, 
6.5, 6.5, 6.1, 5.9, 5.9, 5.71, 5.9, 5.9, 5.71, 4.99, 4.65, 5.16, 
5.34, 5.34, 4.65, 4.99, 5.71, 5.34, 5.34, 5.34, 5.34, 4.99, 5.34, 
5.34, 5.34, 5.34, 5.52, 5.34, 5.52, 5.71, 6.5, 7.71, 6.9, 6.5, 
6.7, 6.1, 5.9, 6.1, 5.9, 5.71, 7.92, 7.71, 7.1, 7.92, 5.34, 5.16, 
8.14, 10.1, 7.92, 7.3, 6.9, 6.9, 6.9, 8.58, 7.3, 6.9, 7.3, 6.3, 
5.16, 6.1, 5.52, 4.99, 5.34, 5.34, 5.34, 5.16, 5.71, 5.52, 5.52, 
5.16, 4.82, 5.52, 6.1, 5.9, 5.71, 5.52, 5.16, 4.99, 4.48, 4.82, 
5.16, 5.16, 5.16, 5.16, 5.16, 4.82, 4.65, 3.82, 4.14, 4.65, 4.65, 
4.31, 4.31, 5.16, 5.16, 5.16, 5.16, 5.16, 4.99, 4.65, 5.16, 5.16, 
5.16, 5.16, 5.16, 5.16, 5.16, 5.16, 5.34, 5.34)), .Names = c("x", 
"y"), row.names = c(NA, -365L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x0000000005860788>)

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United         States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.6 readxl_0.1.0     mgcv_1.8-7       nlme_3.1-121         scales_0.3.0     sos_1.3-8        brew_1.0-6       ggplot2_1.0.1   
[9] MASS_7.3-43     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.1      lattice_0.20-33  digest_0.6.8     chron_2.3-47     grid_3.2.2       plyr_1.8.3       gtable_0.1.2     magrittr_1.5    
 [9] stringi_0.5-5    reshape2_1.4.1   Matrix_1.2-2     labeling_0.3     proto_0.3-10     tools_3.2.2      stringr_1.0.0    munsell_0.4.2   
[17] colorspace_1.2-6

Partial Solution

I still don't really know why the two methods are giving different answers, and that bothers me. However, after much internet searching, I did find the following workaround:

library(readxl)
library(data.table)
library(ggplot2)
library(scales)
library(mgcv)

stackOF_data <- read_excel("C:/Users/jel4049/Desktop/mean-daily-flow-cumecs-    vatnsdals.xlsx", sheet = "Data")
stackOF_data <- data.table(stackOF_data)

stackOF_data <- stackOF_data[,.(timeseries=as.POSIXct(Date,format("%Y-%m-%d")),mdf)]
a <- stackOF_data[,.(x=as.numeric(timeseries),y=mdf)]
a1 <- gam(y~s(x, k=100, bs="cs"),data=a) 
a2=data.table(gam_mdf = predict(a1,a))

preds <- predict(a1,se.fit=TRUE)
my_data <- data.frame(mu=preds$fit, low =(preds$fit - 1.96 * preds$se.fit), high = (preds$fit + 1.96 * preds$se.fit))


m <- ggplot()+
  geom_line(data = a2, aes(x=stackOF_data$timeseries, y=gam_mdf), size=1, col="blue")+
  geom_smooth(data=my_data,aes(ymin = low, ymax = high, x=stackOF_data$timeseries, y = mu), stat = "identity", col="green")
m

Now at least I know that the summary and data fit quality info I can get from some of the mgcv functions match my plots.

解决方案

It turns out the differences you were seeing was because you were using the default for the n argument in stat_smooth.

From the help page:

n number of points to evaluate smoother at

Of course, it didn't jump out at me right away that this meant n controls the size of the dataset for the newdata argument in predict and therefore stat_smooth doesn't use the original dataset when making the predictions. But I was reading this nice answer on a different stat_smooth question and realized that to figure out what was going on I should take a closer look at the stat_smooth predictions vs manual predictions from a fitted gam model.

So, using your dataset from your OP, which I named dat, we can check what's going on.

The plot when k = 100, after fitting the model via gam and adding the predictions to the dataset. As you noted, the blue (stat_smooth) and black (manual predictions) don't match.

dat$predgam = predict(gam(y ~ s(x, k = 100), data = dat))

(p1 = ggplot(dat, aes(x, y)) +
    geom_point() +
    geom_smooth(method = "gam", formula = y ~ s(x, k = 100)) +
    geom_line(aes(y = predgam)))

You can always use ggplot_build to look at your plot object and see all the pieces that make it up (I'm not showing the results here because it takes up so much space, but the output will print to your Console).

ggplot_build(p1)

The prediction dataset for stat_smooth is the second in the list of datasets.

ggplot_build(p1)$data[[2]]

But look how many rows that dataset has:

nrow(ggplot_build(p1)$data[[2]])
[1] 80

The default setting for the n argument is 80, but you have 365 rows in your dataset. So what happens if you change n to 365? I'll make the smooth line fatter so you can actually see it (in blue).

(p2 = ggplot(dat, aes(x, y)) +
    geom_point() +
    geom_smooth(method = "gam", formula = y ~ s(x, k = 100), n = 365, size = 2) +
    geom_line(aes(y = predgam)))

nrow(ggplot_build(p2)$data[[2]])
[1] 365

If you look at the code for the predictdf function mentioned in the Details section of the stat_smooth help page you'll see that the original dataset isn't used when making predictions. Instead, a sequence is made from the original explanatory variable. This is something that can be really important when working with a small dataset and you need more prediction points in order for the line to look smooth. In your case, though, the original dataset is already a nice smooth sequence of x so using n = 365 gets the same predictions from stat_smooth as the original dataset does.

You can see the code for predictdf here.

这篇关于stat_smooth gam与gam {mgcv}不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆