如何在回归模型中指定协变量 [英] How to specify covariates in a regression model
问题描述
我要分析的数据集是这样的
The dataset I would like to analyse looks like this
n <- 4000
tmp <- t(replicate(n, sample(49,6)))
dat <- matrix(0, nrow=n, ncol=49)
colnames(dat) <- paste("p", 1:49, sep="")
dat <- as.data.frame(dat)
dat[, "win.frac"] <- rnorm(n, mean=0.0176504, sd=0.002)
for (i in 1:nrow(dat))
for (j in 1:6) dat[i, paste("p", tmp[i, j], sep="")] <- 1
str(dat)
现在我想用依赖变量 win.frac
和所有其他变量 (p1
, ..., p49
) 作为解释变量.
Now I would like to perform a regression with depended variable win.frac
and all other variables (p1
, ..., p49
) as explanatory variables.
但是,通过我尝试的所有方法,我将 p49
的系数设为 NA,并显示消息1 由于奇点而未定义".我试过了
However, with all approaches I tried I get the coefficient for p49
as NA, with the message "1 not defined because of singularities". I tried
modspec <- paste("win.frac ~", paste("p", 1:49, sep="", collapse=" + "))
fit1 <- lm(as.formula(modspec), data=dat)
fit2 <- lm(win.frac ~ ., data=dat)
有趣的是,如果我使用 48 个解释变量,则回归有效.这可能 (p2, ..., p49) 或可能不 (p1, ..., p48) 包含 p49,因此我认为这与变量 p49 本身无关.我还尝试了更大的 n
值,结果相同.
Interestingly, the regression works if I use 48 explanatory variables. This may (p2, ..., p49) or may not (p1, ..., p48) contain the p49, hence I think this
is not related to the variable p49 itself. I also tried larger values of n
, with the same result.
我还尝试了 betareg
包中的 betareg
,因为 win.frac
被限制在 0 和 1 之间.这种情况下的回归失败同样,错误消息(粗略翻译)优化(...)中的错误:指定优化的非有限值"
I also tried betareg
from the betareg
package, since win.frac
is restricted between 0 and 1. The regression in this case fails too, with the error message (roughly translated) "error in optim(...): non-finite value of optim specified"
library(betareg)
fit3 <- betareg(as.formula(modspec), data=dat, link="log")
现在我卡住了.我怎样才能执行这个回归?是否有最大变量?这个问题是因为解释变量不是 0 就是 1?
Now I am stuck. How can I perform this regression? Is there a maximum of variables? Is this problem due to the fact that the explanatory variables are either 0 or 1?
非常感谢任何提示!
推荐答案
我假设这些是虚拟编码的因子变量.
I assume that those are dummy encoded factor variables.
如果您执行以下操作,您会发现如果您尝试将一个回归量与所有其他回归量建模,就会得到完美的拟合:
If you do the following you can see that you get a perfect fit if you try to model one of your regressors with all others:
regressormod <- lm(p49 ~ . - win.frac, data = dat)
summary(regressormod)$r.sq
#[1] 1
(在数学上)不可能在回归模型中包含来自虚拟编码因子变量的所有系数,该模型还包括截距 (请参阅交叉验证的此答案).这就是为什么 R 在默认情况下排除一个因子级别的原因,如果您让它为您进行虚拟编码.
It's (mathematically) impossible to include all coeffcients from dummy-encoded factor variables in a regression model that also includes an intercept (see this answer on Cross Validated). That's why R excludes one factor level by default if you let it do the dummy encoding for you.
这篇关于如何在回归模型中指定协变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!