使用glm拟合逻辑回归的默认起始值 [英] Default starting values fitting logistic regression with glm
问题描述
我想知道如何在 glm
中指定默认起始值.
I'm wondering how are default starting values specified in glm
.
此帖子建议默认值为设置为零.一个表示其背后有一个算法,但是相关链接已断开
This post suggests that default values are set as zeros. This one says that there is an algorithm behind it, however relevant link is broken.
我试图用算法跟踪拟合简单的逻辑回归模型:
I tried to fit simple logistic regression model with algorithm trace:
set.seed(123)
x <- rnorm(100)
p <- 1/(1 + exp(-x))
y <- rbinom(100, size = 1, prob = p)
# to see parameter estimates in each step
trace(glm.fit, quote(print(coefold)), at = list(c(22, 4, 8, 4, 19, 3)))
首先,不指定初始值:
glm(y ~ x, family = "binomial")
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, .... step 22,4,8,4,19,3
NULL
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, .... step 22,4,8,4,19,3
[1] 0.386379 1.106234
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, .... step 22,4,8,4,19,3
[1] 0.3991135 1.1653971
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, .... step 22,4,8,4,19,3
[1] 0.3995188 1.1669508
第一步,初始值为 NULL
.
第二,我将起始值设置为零:
Second, I set starting values to be zeros:
glm(y ~ x, family = "binomial", start = c(0, 0))
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, .... step 22,4,8,4,19,3
[1] 0 0
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, .... step 22,4,8,4,19,3
[1] 0.3177530 0.9097521
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, .... step 22,4,8,4,19,3
[1] 0.3909975 1.1397163
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, .... step 22,4,8,4,19,3
[1] 0.3994147 1.1666173
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, .... step 22,4,8,4,19,3
[1] 0.3995191 1.1669518
我们可以看到第一和第二种方法之间的迭代是不同的.
And we can see that iterations between first and second approach differ.
要查看由 glm
指定的初始值,我尝试通过一次迭代来拟合模型:
To see initial values specified by glm
I tried to fit model with only one iteration:
glm(y ~ x, family = "binomial", control = list(maxit = 1))
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, .... step 22,4,8,4,19,3
NULL
Call: glm(formula = y ~ x, family = "binomial", control = list(maxit = 1))
Coefficients:
(Intercept) x
0.3864 1.1062
Degrees of Freedom: 99 Total (i.e. Null); 98 Residual
Null Deviance: 134.6
Residual Deviance: 115 AIC: 119
参数估计(并不奇怪)对应于第二次迭代中第一种方法的估计,即 [1] 0.386379 1.106234
将这些值设置为初始值会导致与第一种方法相同的迭代序列:
Estimates of parameters (not surprisingly) correspond to estimates of the first approach in the second iteration i.e., [1] 0.386379 1.106234
Setting these values as initial values leads to the same iterations sequence as in the first approach:
glm(y ~ x, family = "binomial", start = c(0.386379, 1.106234))
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, .... step 22,4,8,4,19,3
[1] 0.386379 1.106234
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, .... step 22,4,8,4,19,3
[1] 0.3991135 1.1653971
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, .... step 22,4,8,4,19,3
[1] 0.3995188 1.1669508
问题是,这些值是如何计算的?
So the question is, how these values are calculated?
推荐答案
TL; DR
-
start = c(b0,b1)
将eta初始化为b0 + x * b1
(mu等于1/(1 + exp(-eta)))TL;DR
start=c(b0,b1)
initializes eta tob0+x*b1
(mu to 1/(1+exp(-eta)))start = c(0,0)
会将eta初始化为0(μ到0.5),而与y或x值无关.start=c(0,0)
initializes eta to 0 (mu to 0.5) regardless of y or x value.start = NULL
如果y = 1,则初始化eta = 1.098612(mu = 0.75),无论x值如何.start=NULL
initializes eta= 1.098612 (mu=0.75) if y=1, regardless of x value.start = NULL
如果y = 0,则初始化eta = -1.098612(mu = 0.25),无论x值如何.start=NULL
initializes eta=-1.098612 (mu=0.25) if y=0, regardless of x value.一旦计算出eta(因此,mu和var(mu)),就计算出
w
和z
并将其发送到QR解算器中,qr.solve(cbind(1,x)* w,z * w)
的精神.Once eta (and consequently mu and var(mu)) has been calculated,
w
andz
are calculated and sent to a QR solver, in the spirit ofqr.solve(cbind(1,x) * w, z*w)
.以罗兰(Roland)的评论为依据:我做了一个
glm.fit.truncated()
,在其中我将glm.fit
降到了C_Cdqrls
致电,然后将其注释掉.glm.fit.truncated
输出z
和w
值(以及用于计算z 的数量的值
和w
),然后将其传递给C_Cdqrls
调用:Building off Roland's comment: I made a
glm.fit.truncated()
, where I tookglm.fit
down to theC_Cdqrls
call, and then commented it out.glm.fit.truncated
outputs thez
andw
values (as well as the values of the quantities used to calculatez
andw
) which would then be passed to theC_Cdqrls
call:## call Fortran code via C wrapper fit <- .Call(C_Cdqrls, x[good, , drop = FALSE] * w, z * w, min(1e-7, control$epsilon/1000), check=FALSE)
可以阅读有关
C_Cdqrls
的更多信息此处.幸运的是,基数R中的函数qr.solve
直接进入glm.fit()
中被调用的LINPACK版本.More can be read about
C_Cdqrls
here. Luckily, the functionqr.solve
in base R taps directly into the LINPACK versions being called upon inglm.fit()
.因此,针对不同的起始值规范运行
glm.fit.truncated
,然后使用w和z值调用qr.solve
,然后我们看看如何开始值"(或第一个显示的迭代值)进行计算.正如Roland所指出的,在glm()中指定start = NULL
或start = c(0,0)
会影响w和z的计算,而不会影响 表示start
.So we run
glm.fit.truncated
for the different starting value specifications, and then do a call toqr.solve
with the w and z values, and we see how the "starting values" (or the first displayed iteration values) are calculated. As Roland indicated, specifyingstart=NULL
orstart=c(0,0)
in glm() affects the calculations for w and z, not forstart
.对于start = NULL:
z
是一个向量,其中元素的值为2.431946或-2.431946,而w
是一个向量,其中所有元素的值为0.4330127:>For the start=NULL:
z
is a vector where the elements have the value 2.431946 or -2.431946 andw
is a vector where all elements are 0.4330127:start.is.null <- glm.fit.truncated(x,y,family=binomial(), start=NULL) start.is.null w <- start.is.null$w z <- start.is.null$z ## if start is NULL, the first displayed values are: qr.solve(cbind(1,x) * w, z*w) # > qr.solve(cbind(1,x) * w, z*w) # x # 0.386379 1.106234
对于start = c(0,0):
z
是一个向量,其中元素的值为2或-2,而w
是一个向量,其中所有元素的值是0.5:For the start=c(0,0):
z
is a vector where the elements have the value 2 or -2 andw
is a vector where all elements are 0.5:## if start is c(0,0) start.is.00 <- glm.fit.truncated(x,y,family=binomial(), start=0) start.is.00 w <- start.is.00$w z <- start.is.00$z ## if start is c(0,0), the first displayed values are: qr.solve(cbind(1,x) * w, z*w) # > qr.solve(cbind(1,x) * w, z*w) # x # 0.3177530 0.9097521
这很好,但是我们如何计算
w
和z
?在glm.fit.truncated()
底部附近,我们看到了So that's all well and good, but how do we calculate the
w
andz
? Near the bottom ofglm.fit.truncated()
we seez <- (eta - offset)[good] + (y - mu)[good]/mu.eta.val[good] w <- sqrt((weights[good] * mu.eta.val[good]^2)/variance(mu)[good])
请查看用于计算
z
和w
的数量的输出值之间的以下比较:Look at the following comparisons between the outputted values of the quantities used to calculate
z
andw
:cbind(y, start.is.null$mu, start.is.00$mu) cbind(y, start.is.null$eta, start.is.00$eta) cbind(start.is.null$var_mu, start.is.00$var_mu) cbind(start.is.null$mu.eta.val, start.is.00$mu.eta.val)
请注意,
start.is.00
的矢量mu
的值仅为0.5,因为eta设置为0且mu(eta)= 1/(1+exp(-0))= 0.5.start.is.null
将y = 1的那些设置为mu = 0.75(对应于eta = 1.098612),将y = 0的那些设置为mu = 0.25(对应于eta = -1.098612)),因此var_mu
= 0.75 * 0.25 = 0.1875.Note that
start.is.00
will have vectormu
with only the values 0.5 because eta is set to 0 and mu(eta) = 1/(1+exp(-0))= 0.5.start.is.null
sets those with y=1 to be mu=0.75 (which corresponds to eta=1.098612) and those with y=0 to be mu=0.25 (which corresponds to eta=-1.098612), and thus thevar_mu
= 0.75*0.25 = 0.1875.但是,值得注意的是,我更改了种子并重新运行了所有内容,y = 1的mu = 0.75,y = 0的mu = 0.25(因此其他数量保持不变).也就是说,start = NULL引起相同的
w
和z
,而不管什么y
和x
是,因为如果y = 1则初始化eta = 1.098612(mu = 0.75),如果y = 0则初始化eta = -1.098612(mu = 0.25).However, it is interesting to note, that I changed the seed and reran everything and the mu=0.75 for y=1 and mu=0.25 for y=0 (and thus the other quantities stayed the same). That is to say, start=NULL gives rise to the same
w
andz
regardless of whaty
andx
are, because they initialize eta=1.098612 (mu=0.75) if y=1 and eta=-1.098612 (mu=0.25) if y=0.因此,似乎没有为start = NULL设置Intercept系数和X系数的起始值,而是根据y值并且独立于x值将eta赋给了初始值.从那里计算出
w
和z
,然后与x
一起发送到qr.solver.So it appears that a starting value for the Intercept coefficient and for the X-coefficient is not set for start=NULL, but rather initial values are given to eta depending on the y-value and independent of the x-value. From there
w
andz
are calculated, then sent along withx
to the qr.solver.set.seed(123) x <- rnorm(100) p <- 1/(1 + exp(-x)) y <- rbinom(100, size = 1, prob = p) glm.fit.truncated <- function(x, y, weights = rep.int(1, nobs), start = 0,etastart = NULL, mustart = NULL, offset = rep.int(0, nobs), family = binomial(), control = list(), intercept = TRUE, singular.ok = TRUE ){ control <- do.call("glm.control", control) x <- as.matrix(x) xnames <- dimnames(x)[[2L]] ynames <- if(is.matrix(y)) rownames(y) else names(y) conv <- FALSE nobs <- NROW(y) nvars <- ncol(x) EMPTY <- nvars == 0 ## define weights and offset if needed if (is.null(weights)) weights <- rep.int(1, nobs) if (is.null(offset)) offset <- rep.int(0, nobs) ## get family functions: variance <- family$variance linkinv <- family$linkinv if (!is.function(variance) || !is.function(linkinv) ) stop("'family' argument seems not to be a valid family object", call. = FALSE) dev.resids <- family$dev.resids aic <- family$aic mu.eta <- family$mu.eta unless.null <- function(x, if.null) if(is.null(x)) if.null else x valideta <- unless.null(family$valideta, function(eta) TRUE) validmu <- unless.null(family$validmu, function(mu) TRUE) if(is.null(mustart)) { ## calculates mustart and may change y and weights and set n (!) eval(family$initialize) } else { mukeep <- mustart eval(family$initialize) mustart <- mukeep } if(EMPTY) { eta <- rep.int(0, nobs) + offset if (!valideta(eta)) stop("invalid linear predictor values in empty model", call. = FALSE) mu <- linkinv(eta) ## calculate initial deviance and coefficient if (!validmu(mu)) stop("invalid fitted means in empty model", call. = FALSE) dev <- sum(dev.resids(y, mu, weights)) w <- sqrt((weights * mu.eta(eta)^2)/variance(mu)) residuals <- (y - mu)/mu.eta(eta) good <- rep_len(TRUE, length(residuals)) boundary <- conv <- TRUE coef <- numeric() iter <- 0L } else { coefold <- NULL eta <- if(!is.null(etastart)) etastart else if(!is.null(start)) if (length(start) != nvars) stop(gettextf("length of 'start' should equal %d and correspond to initial coefs for %s", nvars, paste(deparse(xnames), collapse=", ")), domain = NA) else { coefold <- start offset + as.vector(if (NCOL(x) == 1L) x * start else x %*% start) } else family$linkfun(mustart) mu <- linkinv(eta) if (!(validmu(mu) && valideta(eta))) stop("cannot find valid starting values: please specify some", call. = FALSE) ## calculate initial deviance and coefficient devold <- sum(dev.resids(y, mu, weights)) boundary <- conv <- FALSE ##------------- THE Iteratively Reweighting L.S. iteration ----------- for (iter in 1L:control$maxit) { good <- weights > 0 varmu <- variance(mu)[good] if (anyNA(varmu)) stop("NAs in V(mu)") if (any(varmu == 0)) stop("0s in V(mu)") mu.eta.val <- mu.eta(eta) if (any(is.na(mu.eta.val[good]))) stop("NAs in d(mu)/d(eta)") ## drop observations for which w will be zero good <- (weights > 0) & (mu.eta.val != 0) if (all(!good)) { conv <- FALSE warning(gettextf("no observations informative at iteration %d", iter), domain = NA) break } z <- (eta - offset)[good] + (y - mu)[good]/mu.eta.val[good] w <- sqrt((weights[good] * mu.eta.val[good]^2)/variance(mu)[good]) # ## call Fortran code via C wrapper # fit <- .Call(C_Cdqrls, x[good, , drop = FALSE] * w, z * w, # min(1e-7, control$epsilon/1000), check=FALSE) # #print(iter) #print(z) #print(w) } } return(list(z=z, w=w, mustart=mustart, etastart=etastart, eta=eta, offset=offset, mu=mu, mu.eta.val=mu.eta.val, weight=weights, var_mu=variance(mu))) }
这篇关于使用glm拟合逻辑回归的默认起始值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!