R二项式GLM中的准分离是否重要? [英] Does Quasi Separation matter in R binomial GLM?
问题描述
我正在学习准分离如何影响R二项式GLM.我开始认为在某些情况下这没关系.
I am learning how the quasi-separation affects R binomial GLM. And I start to think that it does not matter in some circumstance.
据我了解,我们说数据在 某些因素水平的线性组合可以完全识别故障/非故障.
In my understanding, we say that the data has quasi separation when some linear combination of factor levels can completely identify failure/non-failure.
所以我创建了一个人工数据集,其中在R中的准分隔为:
So I created an artificial dataset with a quasi separation in R as:
fail <- c(100,100,100,100)
nofail <- c(100,100,0,100)
x1 <- c(1,0,1,0)
x2 <- c(0,0,1,1)
data <- data.frame(fail,nofail,x1,x2)
rownames(data) <- paste("obs",1:4)
然后,当x1 = 1和x2 = 1(obs 3)时,数据始终不会失败. 在此数据中,我的协变量矩阵包含三列:intercept,x1和x2.
Then when x1=1 and x2=1 (obs 3) the data always doesn't fail. In this data, my covariate matrix has three columns: intercept, x1 and x2.
据我了解,准分离导致无穷大的估计.所以glm fit应该会失败.但是,以下glm fit不会失败:
In my understanding, quasi-separation results in estimate of infinite value. So glm fit should fail. However, the following glm fit does NOT fail:
summary(glm(cbind(fail,nofail)~x1+x2,data=data,family=binomial))
结果是:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.4342 0.1318 -3.294 0.000986 ***
x1 0.8684 0.1660 5.231 1.69e-07 ***
x2 0.8684 0.1660 5.231 1.69e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
标准.即使是准分离,错误似乎也很合理. 谁能告诉我为什么准分离不影响glm fit结果吗?
Std. Error seems very reasonable even with the quasi separation. Could anyone tell me why the quasi separation is NOT affecting the glm fit result?
推荐答案
您已经构造了一个有趣的示例,但您并未测试实际检查您描述为准分离的情况的模型.当您说:当x1 = 1和x2 = 1(obs 3)时,数据总是失败.",这意味着模型中需要交互项.请注意,这会产生更有趣"的结果:
You have constructed an interesting example but you are not testing a model that actually examines the situation that you are describing as quasi-separation. When you say: "when x1=1 and x2=1 (obs 3) the data always fails.", you are implying the need for an interaction term in the model. Notice that this produces a "more interesting" result:
> summary(glm(cbind(fail,nofail)~x1*x2,data=data,family=binomial))
Call:
glm(formula = cbind(fail, nofail) ~ x1 * x2, family = binomial,
data = data)
Deviance Residuals:
[1] 0 0 0 0
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.367e-17 1.414e-01 0.000 1
x1 2.675e-17 2.000e-01 0.000 1
x2 2.965e-17 2.000e-01 0.000 1
x1:x2 2.731e+01 5.169e+04 0.001 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1.2429e+02 on 3 degrees of freedom
Residual deviance: 2.7538e-10 on 0 degrees of freedom
AIC: 25.257
Number of Fisher Scoring iterations: 22
通常需要非常怀疑β系数为2.731e + 01:隐式优势比i:
One generally needs to be very suspect of beta coefficients of 2.731e+01: The implicit odds ratio i:
> exp(2.731e+01)
[1] 725407933166
在这种工作环境中,Inf与725,407,933,166之间确实没有实质性区别.
In this working environment there really is no material difference between Inf and 725,407,933,166.
这篇关于R二项式GLM中的准分离是否重要?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!