Caret软件包-使用平滑和线性预测变量对GAM进行交叉验证 [英] Caret package - cross-validating GAM with both smooth and linear predictors

查看:331
本文介绍了Caret软件包-使用平滑和线性预测变量对GAM进行交叉验证的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用插入符号对GAM模型进行交叉验证。我的GAM模型具有一个二进制结果变量,一个纬度和经度坐标对的各向同性平滑,然后是线性预测变量。使用mgcv时的典型语法为:

  gam1<-gam(y〜s(lat,long)+ x1 + x2, family =二项式(logit))

我不太确定如何使用火车指定此模型在插入符号中起作用。这或多或少是我的语法:

  cv<-train(y〜lat + long + x1 + x2,
数据=数据,
方法= gam,
家庭=二项式,
trControl = trainControl(方法= LOOCV,数字= 1,重复=),
tuneGrid = data.frame(方法= GCV.Cp,选择= FALSE))

问题是我只希望对经纬度和经度进行平滑处理,并且将x1和x2视为线性。



谢谢!

解决方案

看到有人在 mgcv mgcv 非常有趣c>。经过一些研究,我在这里令您感到沮丧:至少将 mgcv 插入符结合使用是一个坏主意在插入符号的当前支持下。



让我问一些基本问题,如果您使用的是插入符


  1. 如何指定结数以及样条基类

  2. 如何指定2D平滑函数?

  3. 如何使用 te指定张量积样条 ti

  4. 如何调整平滑参数?

如果您想知道 caret :: train 在使用 method = gam ,请查看其拟合例程:

  getModelInfo(model = gam,regex = FALSE)$ gam $ fit 

函数(x,y,wts,param,lev,last,classProbs,...){
dat<-if(is.data.frame (x))x else as.data.frame(x)
modForm<-caret ::: smootherFormula(x)
if(is.factor(y)){
dat $ .outcome<-ifelse(y == lev [1],0,1 )
dist<-binomial()
}否则{
dat $ .outcome<-y
dist<-高斯()
}
modelArgs<-list(公式= modForm,
数据= dat,
选择= param $ select,
方法= as.character(param $ method))
##拦截家庭,如果传入
theDots<-list(...)
if(!any(names(theDots)== family))modelArgs $ family<-dist
modelArgs<-c(modelArgs,theDots)
out<-do.call(getFromNamespace( gam, mgcv),modelArgs)
out
}

您会看到 modForm<-插入符:::: smootherFormul a(x)行?该行是关键,而其他行只是模型调用的常规构造。因此,让我们检查一下正在构建的GAM公式插入符

 插入符号::: smootherFormula 

函数(数据,更平滑= s,cut = 10,df = 0,范围= 0.5,
度= 1,y =。outcome )
{
nzv<-nearZeroVar(data)
if(length(nzv)> 0)
data<-data [,-nzv,drop = FALSE]
numValues<-排序(apply(data,2,function(x)length(unique(x))))
前缀<-rep(,ncol(data))
后缀<-rep(,ncol(data))
前缀[numValues> cut]<-paste(smoother,(,sep =)
if(smoother == s){
后缀[numValues> cut]<--if(df = = 0)

else paste(,df =,df,),sep =)
}
if(smoother == lo ){
后缀[numValues> cut]<-paste(,span =,span,,degree =,
degree,),sep =)
}
if(smoother == rcs){
后缀[numValues> cut]<-)
}
rhs<-paste(前缀,名称(numValues),后缀,sep =)
rhs<-paste(rhs,折叠) = +)
格式<-as.formula(paste(y,rhs,sep =〜))
格式
}


为此,您失去了对 mgcv

为验证这一点,让我构造一个与您的案例类似的示例:

  set.seed(0)
dat<-gamSim(eg = 2,scale = 0.2)$ data [1:3]
dat $ a< -符文(400)
dat $ b<-符文(400)
dat $ y<-with(dat,y + 0.3 * a-0.7 * b)

#yxzab
#1 -0.30258559 0.8966972 0.1478457 0.07721866 0.3871130
#2 -0.59518832 0.2655087 0.6588776 0.13853856 0.8718050
#3 -0.06978648 0.3721239 0.1850700 0.04752457 0.9671970
#4 -0.17002059 0.5728534887 0.8669163
#5 0.55452069 0.9082078 0.8978485 0.91608902 0.4377153
#6 -0.17763650 0.2016819 0.9436971 0.84020039 0.1919378

所以我们的目标是拟合模型: y〜s(x,z)+ a + b 。数据 y 是高斯的,但这无关紧要;它不会影响插入符 mgcv 的工作方式。

  cv<-train(y〜x + z + a + b,data = dat,method = gam,family = gaussian,
trControl = trainControl(方法= LOOCV,数字= 1,重复次数= 1),
tuneGrid = data.frame(方法= GCV.Cp,选择= FALSE))

您可以提取最终模型:

  fit< ;-cv [[11]] 

那么它使用什么公式?

  fit $ formula 
#。结果〜s(x)+ s(z)+ s(a)+ s(b)

看到了吗?除了是可加的,单变量的之外,它还将 mgcv :: s 的所有内容都保留为默认值:default bs = tp ,默认 k = 10 ,等等。


I would like to cross validate a GAM model using caret. My GAM model has a binary outcome variable, an isotropic smooth of latitude and longitude coordinate pairs, and then linear predictors. Typical syntax when using mgcv is:

gam1 <- gam( y ~ s(lat , long) + x1 + x2, family = binomial(logit) )

I'm not quite sure how to specify this model using the train function in caret. This is my syntax more or less:

cv <- train(y ~ lat + long + x1 + x2, 
            data = data, 
            method = "gam", 
            family = "binomial", 
            trControl = trainControl(method = "LOOCV", number=1, repeats=), 
            tuneGrid = data.frame(method = "GCV.Cp", select = FALSE))

The problem is that I only want lat and long to be smoothed and x1 and x2 to be treated as linear.

Thanks!

解决方案

It is very interesting to see someone using mgcv outside mgcv. After a bit of research, I am here to frustrate you: using mgcv with caret is a bad idea, at least with current support from caret.

Let's me just ask you a few fundamental questions, if you are using caret:

  1. How can you specify the number of knots, as well as spline basis class for a smooth function?
  2. How can you specify 2D smooth function?
  3. How can you specify tensor product spline with te or ti?
  4. How can you tweak with smoothing parameters?

If you want to know what caret::train is doing with method = "gam", check out its fitting routine:

getModelInfo(model = "gam", regex = FALSE)$gam$fit

function(x, y, wts, param, lev, last, classProbs, ...) { 
            dat <- if(is.data.frame(x)) x else as.data.frame(x)
            modForm <- caret:::smootherFormula(x)
            if(is.factor(y)) {
              dat$.outcome <- ifelse(y == lev[1], 0, 1)
              dist <- binomial()
            } else {
              dat$.outcome <- y
              dist <- gaussian()
            }
            modelArgs <- list(formula = modForm,
                              data = dat,
                              select = param$select, 
                              method = as.character(param$method))
            ## Intercept family if passed in
            theDots <- list(...)
            if(!any(names(theDots) == "family")) modelArgs$family <- dist
            modelArgs <- c(modelArgs, theDots)                 
            out <- do.call(getFromNamespace("gam", "mgcv"), modelArgs)
            out    
            }

You see the modForm <- caret:::smootherFormula(x) line? That line is the key, while other lines is just routine construction of a model call. So, let's have a check with what GAM formula caret is constructing:

caret:::smootherFormula

function (data, smoother = "s", cut = 10, df = 0, span = 0.5, 
    degree = 1, y = ".outcome") 
{
    nzv <- nearZeroVar(data)
    if (length(nzv) > 0) 
        data <- data[, -nzv, drop = FALSE]
    numValues <- sort(apply(data, 2, function(x) length(unique(x))))
    prefix <- rep("", ncol(data))
    suffix <- rep("", ncol(data))
    prefix[numValues > cut] <- paste(smoother, "(", sep = "")
    if (smoother == "s") {
        suffix[numValues > cut] <- if (df == 0) 
            ")"
        else paste(", df=", df, ")", sep = "")
    }
    if (smoother == "lo") {
        suffix[numValues > cut] <- paste(", span=", span, ",degree=", 
            degree, ")", sep = "")
    }
    if (smoother == "rcs") {
        suffix[numValues > cut] <- ")"
    }
    rhs <- paste(prefix, names(numValues), suffix, sep = "")
    rhs <- paste(rhs, collapse = "+")
    form <- as.formula(paste(y, rhs, sep = "~"))
    form
}

In short, it creates additive, univariate smooth. This is the classic form when GAM was first proposed.

To this end, you lose a significant amount of control on mgcv, as listed previously.

To verify this, let me construct a similar example to your case:

set.seed(0)
dat <- gamSim(eg = 2, scale = 0.2)$data[1:3]
dat$a <- runif(400)
dat$b <- runif(400)
dat$y <- with(dat, y + 0.3 * a - 0.7 * b)

#            y         x         z          a         b
#1 -0.30258559 0.8966972 0.1478457 0.07721866 0.3871130
#2 -0.59518832 0.2655087 0.6588776 0.13853856 0.8718050
#3 -0.06978648 0.3721239 0.1850700 0.04752457 0.9671970
#4 -0.17002059 0.5728534 0.9543781 0.03391887 0.8669163
#5  0.55452069 0.9082078 0.8978485 0.91608902 0.4377153
#6 -0.17763650 0.2016819 0.9436971 0.84020039 0.1919378

So we aim to fit a model: y ~ s(x, z) + a + b. The data y is Gaussian, but this does not matter; it does not affect how caret works with mgcv.

cv <- train(y ~ x + z + a + b, data = dat, method = "gam", family = "gaussian",
            trControl = trainControl(method = "LOOCV", number=1, repeats=1), 
            tuneGrid = data.frame(method = "GCV.Cp", select = FALSE))

You can extract the final model:

fit <- cv[[11]]

So what formula is it using?

fit$formula
#.outcome ~ s(x) + s(z) + s(a) + s(b)

See? Apart from being "additive, univariate", it also leaves everything of mgcv::s to its default: default bs = "tp", default k = 10, etc.

这篇关于Caret软件包-使用平滑和线性预测变量对GAM进行交叉验证的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆