使用glm拟合逻辑回归的默认起始值 [英] Default starting values fitting logistic regression with glm

查看:65
本文介绍了使用glm拟合逻辑回归的默认起始值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道如何在 glm 中指定默认起始值​​.

I'm wondering how are default starting values specified in glm.

帖子建议默认值为设置为零.一个表示其背后有一个算法,但是相关链接已断开

This post suggests that default values are set as zeros. This one says that there is an algorithm behind it, however relevant link is broken.

我试图用算法跟踪拟合简单的逻辑回归模型:

I tried to fit simple logistic regression model with algorithm trace:

set.seed(123)

x <- rnorm(100)
p <- 1/(1 + exp(-x))
y <- rbinom(100, size = 1, prob = p)

# to see parameter estimates in each step
trace(glm.fit, quote(print(coefold)), at = list(c(22, 4, 8, 4, 19, 3)))

首先,不指定初始值:

glm(y ~ x, family = "binomial")

Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
NULL
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.386379 1.106234
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.3991135 1.1653971
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.3995188 1.1669508

第一步,初始值为 NULL .

第二,我将起始值设置为零:

Second, I set starting values to be zeros:

glm(y ~ x, family = "binomial", start = c(0, 0))

Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0 0
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.3177530 0.9097521
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.3909975 1.1397163
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.3994147 1.1666173
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.3995191 1.1669518

我们可以看到第一和第二种方法之间的迭代是不同的.

And we can see that iterations between first and second approach differ.

要查看由 glm 指定的初始值,我尝试通过一次迭代来拟合模型:

To see initial values specified by glm I tried to fit model with only one iteration:

glm(y ~ x, family = "binomial", control = list(maxit = 1))

Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
NULL

Call:  glm(formula = y ~ x, family = "binomial", control = list(maxit = 1))

Coefficients:
(Intercept)            x  
     0.3864       1.1062  

Degrees of Freedom: 99 Total (i.e. Null);  98 Residual
Null Deviance:      134.6 
Residual Deviance: 115  AIC: 119

参数估计(并不奇怪)对应于第二次迭代中第一种方法的估计,即 [1] 0.386379 1.106234 将这些值设置为初始值会导致与第一种方法相同的迭代序列:

Estimates of parameters (not surprisingly) correspond to estimates of the first approach in the second iteration i.e., [1] 0.386379 1.106234 Setting these values as initial values leads to the same iterations sequence as in the first approach:

glm(y ~ x, family = "binomial", start = c(0.386379, 1.106234))

Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.386379 1.106234
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.3991135 1.1653971
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.3995188 1.1669508

问题是,这些值是如何计算的?

So the question is, how these values are calculated?

推荐答案

TL; DR

  • start = c(b0,b1)将eta初始化为 b0 + x * b1 (mu等于1/(1 + exp(-eta)))

    TL;DR

    • start=c(b0,b1) initializes eta to b0+x*b1 (mu to 1/(1+exp(-eta)))

      start = c(0,0)会将eta初始化为0(μ到0.5),而与y或x值无关.

      start=c(0,0) initializes eta to 0 (mu to 0.5) regardless of y or x value.

      start = NULL 如果y = 1,则初始化eta = 1.098612(mu = 0.75),无论x值如何.

      start=NULL initializes eta= 1.098612 (mu=0.75) if y=1, regardless of x value.

      start = NULL 如果y = 0,则初始化eta = -1.098612(mu = 0.25),无论x值如何.

      start=NULL initializes eta=-1.098612 (mu=0.25) if y=0, regardless of x value.

      一旦计算出eta(因此,mu和var(mu)),就计算出 w z 并将其发送到QR解算器中, qr.solve(cbind(1,x)* w,z * w)的精神.

      Once eta (and consequently mu and var(mu)) has been calculated, w and z are calculated and sent to a QR solver, in the spirit of qr.solve(cbind(1,x) * w, z*w).

      以罗兰(Roland)的评论为依据:我做了一个 glm.fit.truncated(),在其中我将 glm.fit 降到了 C_Cdqrls 致电,然后将其注释掉. glm.fit.truncated 输出 z w 值(以及用于计算 z 的数量的值 w ),然后将其传递给 C_Cdqrls 调用:

      Building off Roland's comment: I made a glm.fit.truncated(), where I took glm.fit down to the C_Cdqrls call, and then commented it out. glm.fit.truncated outputs the z and w values (as well as the values of the quantities used to calculate z and w) which would then be passed to the C_Cdqrls call:

      ## call Fortran code via C wrapper
      fit <- .Call(C_Cdqrls, x[good, , drop = FALSE] * w, z * w,
                   min(1e-7, control$epsilon/1000), check=FALSE) 
      

      可以阅读有关 C_Cdqrls 的更多信息此处.幸运的是,基数R中的函数 qr.solve 直接进入 glm.fit()中被调用的LINPACK版本.

      More can be read about C_Cdqrls here. Luckily, the function qr.solve in base R taps directly into the LINPACK versions being called upon in glm.fit().

      因此,针对不同的起始值规范运行 glm.fit.truncated ,然后使用w和z值调用 qr.solve ,然后我们看看如何开始值"(或第一个显示的迭代值)进行计算.正如Roland所指出的,在glm()中指定 start = NULL start = c(0,0)会影响w和z的计算,而不会影响 表示 start .

      So we run glm.fit.truncated for the different starting value specifications, and then do a call to qr.solve with the w and z values, and we see how the "starting values" (or the first displayed iteration values) are calculated. As Roland indicated, specifying start=NULL or start=c(0,0) in glm() affects the calculations for w and z, not for start.

      对于start = NULL: z 是一个向量,其中元素的值为2.431946或-2.431946,而 w 是一个向量,其中所有元素的值为0.4330127:

      For the start=NULL: z is a vector where the elements have the value 2.431946 or -2.431946 and w is a vector where all elements are 0.4330127:

      start.is.null <- glm.fit.truncated(x,y,family=binomial(), start=NULL)
      start.is.null
      w <- start.is.null$w
      z <- start.is.null$z
      ## if start is NULL, the first displayed values are:
      qr.solve(cbind(1,x) * w, z*w)  
      # > qr.solve(cbind(1,x) * w, z*w)  
      #                 x 
      # 0.386379 1.106234 
      

      对于start = c(0,0): z 是一个向量,其中元素的值为2或-2,而 w 是一个向量,其中所有元素的值是0.5:

      For the start=c(0,0): z is a vector where the elements have the value 2 or -2 and w is a vector where all elements are 0.5:

      ## if start is c(0,0)    
      start.is.00 <- glm.fit.truncated(x,y,family=binomial(), start=0)
      start.is.00
      w <- start.is.00$w
      z <- start.is.00$z
      ## if start is c(0,0), the first displayed values are:    
      qr.solve(cbind(1,x) * w, z*w)  
      # > qr.solve(cbind(1,x) * w, z*w)  
      #                   x 
      # 0.3177530 0.9097521 
      

      这很好,但是我们如何计算 w z ?在 glm.fit.truncated()底部附近,我们看到了

      So that's all well and good, but how do we calculate the w and z? Near the bottom of glm.fit.truncated() we see

      z <- (eta - offset)[good] + (y - mu)[good]/mu.eta.val[good]
      w <- sqrt((weights[good] * mu.eta.val[good]^2)/variance(mu)[good])
      

      请查看用于计算 z w 的数量的输出值之间的以下比较:

      Look at the following comparisons between the outputted values of the quantities used to calculate z and w:

      cbind(y, start.is.null$mu, start.is.00$mu)
      cbind(y, start.is.null$eta, start.is.00$eta)
      cbind(start.is.null$var_mu, start.is.00$var_mu)
      cbind(start.is.null$mu.eta.val, start.is.00$mu.eta.val)
      

      请注意, start.is.00 的矢量 mu 的值仅为0.5,因为eta设置为0且mu(eta)= 1/(1+exp(-0))= 0.5. start.is.null 将y = 1的那些设置为mu = 0.75(对应于eta = 1.098612),将y = 0的那些设置为mu = 0.25(对应于eta = -1.098612)),因此 var_mu = 0.75 * 0.25 = 0.1875.

      Note that start.is.00 will have vector mu with only the values 0.5 because eta is set to 0 and mu(eta) = 1/(1+exp(-0))= 0.5. start.is.null sets those with y=1 to be mu=0.75 (which corresponds to eta=1.098612) and those with y=0 to be mu=0.25 (which corresponds to eta=-1.098612), and thus the var_mu = 0.75*0.25 = 0.1875.

      但是,值得注意的是,我更改了种子并重新运行了所有内容,y = 1的mu = 0.75,y = 0的mu = 0.25(因此其他数量保持不变).也就是说,start = NULL引起相同的 w z ,而不管什么 y x 是,因为如果y = 1则初始化eta = 1.098612(mu = 0.75),如果y = 0则初始化eta = -1.098612(mu = 0.25).

      However, it is interesting to note, that I changed the seed and reran everything and the mu=0.75 for y=1 and mu=0.25 for y=0 (and thus the other quantities stayed the same). That is to say, start=NULL gives rise to the same w and z regardless of what y and x are, because they initialize eta=1.098612 (mu=0.75) if y=1 and eta=-1.098612 (mu=0.25) if y=0.

      因此,似乎没有为start = NULL设置Intercept系数和X系数的起始值,而是根据y值并且独立于x值将eta赋给了初始值.从那里计算出 w z ,然后与 x 一起发送到qr.solver.

      So it appears that a starting value for the Intercept coefficient and for the X-coefficient is not set for start=NULL, but rather initial values are given to eta depending on the y-value and independent of the x-value. From there w and z are calculated, then sent along with x to the qr.solver.

      set.seed(123)
      
      x <- rnorm(100)
      p <- 1/(1 + exp(-x))
      y <- rbinom(100, size = 1, prob = p)
      
      
      glm.fit.truncated <- function(x, y, weights = rep.int(1, nobs), 
      start = 0,etastart = NULL, mustart = NULL, 
      offset = rep.int(0, nobs),
      family = binomial(), 
      control = list(), 
      intercept = TRUE,
      singular.ok = TRUE
      ){
      control <- do.call("glm.control", control)
      x <- as.matrix(x)
      xnames <- dimnames(x)[[2L]]
      ynames <- if(is.matrix(y)) rownames(y) else names(y)
      conv <- FALSE
      nobs <- NROW(y)
      nvars <- ncol(x)
      EMPTY <- nvars == 0
      ## define weights and offset if needed
      if (is.null(weights))
        weights <- rep.int(1, nobs)
      if (is.null(offset))
        offset <- rep.int(0, nobs)
      
      ## get family functions:
      variance <- family$variance
      linkinv  <- family$linkinv
      if (!is.function(variance) || !is.function(linkinv) )
        stop("'family' argument seems not to be a valid family object", call. = FALSE)
      dev.resids <- family$dev.resids
      aic <- family$aic
      mu.eta <- family$mu.eta
      unless.null <- function(x, if.null) if(is.null(x)) if.null else x
      valideta <- unless.null(family$valideta, function(eta) TRUE)
      validmu  <- unless.null(family$validmu,  function(mu) TRUE)
      if(is.null(mustart)) {
        ## calculates mustart and may change y and weights and set n (!)
        eval(family$initialize)
      } else {
        mukeep <- mustart
        eval(family$initialize)
        mustart <- mukeep
      }
      if(EMPTY) {
        eta <- rep.int(0, nobs) + offset
        if (!valideta(eta))
          stop("invalid linear predictor values in empty model", call. = FALSE)
        mu <- linkinv(eta)
        ## calculate initial deviance and coefficient
        if (!validmu(mu))
          stop("invalid fitted means in empty model", call. = FALSE)
        dev <- sum(dev.resids(y, mu, weights))
        w <- sqrt((weights * mu.eta(eta)^2)/variance(mu))
        residuals <- (y - mu)/mu.eta(eta)
        good <- rep_len(TRUE, length(residuals))
        boundary <- conv <- TRUE
        coef <- numeric()
        iter <- 0L
      } else {
        coefold <- NULL
        eta <-
          if(!is.null(etastart)) etastart
        else if(!is.null(start))
          if (length(start) != nvars)
            stop(gettextf("length of 'start' should equal %d and correspond to initial coefs for %s", nvars, paste(deparse(xnames), collapse=", ")),
                 domain = NA)
        else {
          coefold <- start
          offset + as.vector(if (NCOL(x) == 1L) x * start else x %*% start)
        }
        else family$linkfun(mustart)
        mu <- linkinv(eta)
        if (!(validmu(mu) && valideta(eta)))
          stop("cannot find valid starting values: please specify some", call. = FALSE)
        ## calculate initial deviance and coefficient
        devold <- sum(dev.resids(y, mu, weights))
        boundary <- conv <- FALSE
        
        ##------------- THE Iteratively Reweighting L.S. iteration -----------
        for (iter in 1L:control$maxit) {
          good <- weights > 0
          varmu <- variance(mu)[good]
          if (anyNA(varmu))
            stop("NAs in V(mu)")
          if (any(varmu == 0))
            stop("0s in V(mu)")
          mu.eta.val <- mu.eta(eta)
          if (any(is.na(mu.eta.val[good])))
            stop("NAs in d(mu)/d(eta)")
          ## drop observations for which w will be zero
          good <- (weights > 0) & (mu.eta.val != 0)
          
          if (all(!good)) {
            conv <- FALSE
            warning(gettextf("no observations informative at iteration %d",
                             iter), domain = NA)
            break
          }
          z <- (eta - offset)[good] + (y - mu)[good]/mu.eta.val[good]
          w <- sqrt((weights[good] * mu.eta.val[good]^2)/variance(mu)[good])
          # ## call Fortran code via C wrapper
          # fit <- .Call(C_Cdqrls, x[good, , drop = FALSE] * w, z * w,
          #              min(1e-7, control$epsilon/1000), check=FALSE)
          # 
          
          #print(iter)
          #print(z)
          #print(w)
        }
      
        
        }
        return(list(z=z, w=w, mustart=mustart, etastart=etastart, eta=eta, offset=offset, mu=mu, mu.eta.val=mu.eta.val,
                    weight=weights, var_mu=variance(mu)))
      
      }
      

      这篇关于使用glm拟合逻辑回归的默认起始值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆