Caret软件包-使用平滑和线性预测变量对GAM进行交叉验证 [英] Caret package - cross-validating GAM with both smooth and linear predictors
问题描述
我想使用插入符号对GAM模型进行交叉验证。我的GAM模型具有一个二进制结果变量,一个纬度和经度坐标对的各向同性平滑,然后是线性预测变量。使用mgcv时的典型语法为:
gam1<-gam(y〜s(lat,long)+ x1 + x2, family =二项式(logit))
我不太确定如何使用火车指定此模型在插入符号中起作用。这或多或少是我的语法:
cv<-train(y〜lat + long + x1 + x2,
数据=数据,
方法= gam,
家庭=二项式,
trControl = trainControl(方法= LOOCV,数字= 1,重复=),
tuneGrid = data.frame(方法= GCV.Cp,选择= FALSE))
问题是我只希望对经纬度和经度进行平滑处理,并且将x1和x2视为线性。
谢谢!
看到有人在 mgcv $ c $之外使用
mgcv
非常有趣c>。经过一些研究,我在这里令您感到沮丧:至少将 mgcv
与插入符
结合使用是一个坏主意在插入符号
的当前支持下。
让我问一些基本问题,如果您使用的是插入符
:
- 如何指定结数以及样条基类
- 如何指定2D平滑函数?
- 如何使用
te指定张量积样条
或ti
? - 如何调整平滑参数?
如果您想知道 caret :: train
在使用 method = gam
,请查看其拟合例程:
getModelInfo(model = gam,regex = FALSE)$ gam $ fit
函数(x,y,wts,param,lev,last,classProbs,...){
dat<-if(is.data.frame (x))x else as.data.frame(x)
modForm<-caret ::: smootherFormula(x)
if(is.factor(y)){
dat $ .outcome<-ifelse(y == lev [1],0,1 )
dist<-binomial()
}否则{
dat $ .outcome<-y
dist<-高斯()
}
modelArgs<-list(公式= modForm,
数据= dat,
选择= param $ select,
方法= as.character(param $ method))
##拦截家庭,如果传入
theDots<-list(...)
if(!any(names(theDots)== family))modelArgs $ family<-dist
modelArgs<-c(modelArgs,theDots)
out<-do.call(getFromNamespace( gam, mgcv),modelArgs)
out
}
您会看到 modForm<-插入符:::: smootherFormul a(x)
行?该行是关键,而其他行只是模型调用的常规构造。因此,让我们检查一下正在构建的GAM公式插入符
:
插入符号::: smootherFormula
函数(数据,更平滑= s,cut = 10,df = 0,范围= 0.5,
度= 1,y =。outcome )
{
nzv<-nearZeroVar(data)
if(length(nzv)> 0)
data<-data [,-nzv,drop = FALSE]
numValues<-排序(apply(data,2,function(x)length(unique(x))))
前缀<-rep(,ncol(data))
后缀<-rep(,ncol(data))
前缀[numValues> cut]<-paste(smoother,(,sep =)
if(smoother == s){
后缀[numValues> cut]<--if(df = = 0)
)
else paste(,df =,df,),sep =)
}
if(smoother == lo ){
后缀[numValues> cut]<-paste(,span =,span,,degree =,
degree,),sep =)
}
if(smoother == rcs){
后缀[numValues> cut]<-)
}
rhs<-paste(前缀,名称(numValues),后缀,sep =)
rhs<-paste(rhs,折叠) = +)
格式<-as.formula(paste(y,rhs,sep =〜))
格式
}
为此,您失去了对
mgcv $ c $的大量控制权。
为验证这一点,让我构造一个与您的案例类似的示例:
set.seed(0)
dat<-gamSim(eg = 2,scale = 0.2)$ data [1:3]
dat $ a< -符文(400)
dat $ b<-符文(400)
dat $ y<-with(dat,y + 0.3 * a-0.7 * b)
#yxzab
#1 -0.30258559 0.8966972 0.1478457 0.07721866 0.3871130
#2 -0.59518832 0.2655087 0.6588776 0.13853856 0.8718050
#3 -0.06978648 0.3721239 0.1850700 0.04752457 0.9671970
#4 -0.17002059 0.5728534887 0.8669163
#5 0.55452069 0.9082078 0.8978485 0.91608902 0.4377153
#6 -0.17763650 0.2016819 0.9436971 0.84020039 0.1919378
所以我们的目标是拟合模型:
y〜s(x,z)+ a + b
。数据y
是高斯的,但这无关紧要;它不会影响插入符
与mgcv
的工作方式。cv<-train(y〜x + z + a + b,data = dat,method = gam,family = gaussian,
trControl = trainControl(方法= LOOCV,数字= 1,重复次数= 1),
tuneGrid = data.frame(方法= GCV.Cp,选择= FALSE))
您可以提取最终模型:
fit< ;-cv [[11]]
那么它使用什么公式?
fit $ formula
#。结果〜s(x)+ s(z)+ s(a)+ s(b)
看到了吗?除了是可加的,单变量的之外,它还将
mgcv :: s
的所有内容都保留为默认值:defaultbs = tp
,默认k = 10
,等等。I would like to cross validate a GAM model using caret. My GAM model has a binary outcome variable, an isotropic smooth of latitude and longitude coordinate pairs, and then linear predictors. Typical syntax when using mgcv is:
gam1 <- gam( y ~ s(lat , long) + x1 + x2, family = binomial(logit) )
I'm not quite sure how to specify this model using the train function in caret. This is my syntax more or less:
cv <- train(y ~ lat + long + x1 + x2, data = data, method = "gam", family = "binomial", trControl = trainControl(method = "LOOCV", number=1, repeats=), tuneGrid = data.frame(method = "GCV.Cp", select = FALSE))
The problem is that I only want lat and long to be smoothed and x1 and x2 to be treated as linear.
Thanks!
解决方案It is very interesting to see someone using
mgcv
outsidemgcv
. After a bit of research, I am here to frustrate you: usingmgcv
withcaret
is a bad idea, at least with current support fromcaret
.Let's me just ask you a few fundamental questions, if you are using
caret
:
- How can you specify the number of knots, as well as spline basis class for a smooth function?
- How can you specify 2D smooth function?
- How can you specify tensor product spline with
te
orti
?- How can you tweak with smoothing parameters?
If you want to know what
caret::train
is doing withmethod = "gam"
, check out its fitting routine:getModelInfo(model = "gam", regex = FALSE)$gam$fit function(x, y, wts, param, lev, last, classProbs, ...) { dat <- if(is.data.frame(x)) x else as.data.frame(x) modForm <- caret:::smootherFormula(x) if(is.factor(y)) { dat$.outcome <- ifelse(y == lev[1], 0, 1) dist <- binomial() } else { dat$.outcome <- y dist <- gaussian() } modelArgs <- list(formula = modForm, data = dat, select = param$select, method = as.character(param$method)) ## Intercept family if passed in theDots <- list(...) if(!any(names(theDots) == "family")) modelArgs$family <- dist modelArgs <- c(modelArgs, theDots) out <- do.call(getFromNamespace("gam", "mgcv"), modelArgs) out }
You see the
modForm <- caret:::smootherFormula(x)
line? That line is the key, while other lines is just routine construction of a model call. So, let's have a check with what GAM formulacaret
is constructing:caret:::smootherFormula function (data, smoother = "s", cut = 10, df = 0, span = 0.5, degree = 1, y = ".outcome") { nzv <- nearZeroVar(data) if (length(nzv) > 0) data <- data[, -nzv, drop = FALSE] numValues <- sort(apply(data, 2, function(x) length(unique(x)))) prefix <- rep("", ncol(data)) suffix <- rep("", ncol(data)) prefix[numValues > cut] <- paste(smoother, "(", sep = "") if (smoother == "s") { suffix[numValues > cut] <- if (df == 0) ")" else paste(", df=", df, ")", sep = "") } if (smoother == "lo") { suffix[numValues > cut] <- paste(", span=", span, ",degree=", degree, ")", sep = "") } if (smoother == "rcs") { suffix[numValues > cut] <- ")" } rhs <- paste(prefix, names(numValues), suffix, sep = "") rhs <- paste(rhs, collapse = "+") form <- as.formula(paste(y, rhs, sep = "~")) form }
In short, it creates additive, univariate smooth. This is the classic form when GAM was first proposed.
To this end, you lose a significant amount of control on
mgcv
, as listed previously.To verify this, let me construct a similar example to your case:
set.seed(0) dat <- gamSim(eg = 2, scale = 0.2)$data[1:3] dat$a <- runif(400) dat$b <- runif(400) dat$y <- with(dat, y + 0.3 * a - 0.7 * b) # y x z a b #1 -0.30258559 0.8966972 0.1478457 0.07721866 0.3871130 #2 -0.59518832 0.2655087 0.6588776 0.13853856 0.8718050 #3 -0.06978648 0.3721239 0.1850700 0.04752457 0.9671970 #4 -0.17002059 0.5728534 0.9543781 0.03391887 0.8669163 #5 0.55452069 0.9082078 0.8978485 0.91608902 0.4377153 #6 -0.17763650 0.2016819 0.9436971 0.84020039 0.1919378
So we aim to fit a model:
y ~ s(x, z) + a + b
. The datay
is Gaussian, but this does not matter; it does not affect howcaret
works withmgcv
.cv <- train(y ~ x + z + a + b, data = dat, method = "gam", family = "gaussian", trControl = trainControl(method = "LOOCV", number=1, repeats=1), tuneGrid = data.frame(method = "GCV.Cp", select = FALSE))
You can extract the final model:
fit <- cv[[11]]
So what formula is it using?
fit$formula #.outcome ~ s(x) + s(z) + s(a) + s(b)
See? Apart from being "additive, univariate", it also leaves everything of
mgcv::s
to its default: defaultbs = "tp"
, defaultk = 10
, etc.这篇关于Caret软件包-使用平滑和线性预测变量对GAM进行交叉验证的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!