如何在回归模型中指定协变量 [英] How to specify covariates in a regression model

查看:102
本文介绍了如何在回归模型中指定协变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要分析的数据集是这样的

The dataset I would like to analyse looks like this

n <- 4000
tmp <- t(replicate(n, sample(49,6)))
dat <- matrix(0, nrow=n, ncol=49)
colnames(dat) <- paste("p", 1:49, sep="")
dat <- as.data.frame(dat)
dat[, "win.frac"] <- rnorm(n, mean=0.0176504, sd=0.002)
for (i in 1:nrow(dat)) 
  for (j in 1:6) dat[i, paste("p", tmp[i, j], sep="")] <- 1
str(dat)

现在我想用依赖变量 win.frac 和所有其他变量 (p1, ..., p49) 作为解释变量.

Now I would like to perform a regression with depended variable win.frac and all other variables (p1, ..., p49) as explanatory variables.

但是,通过我尝试的所有方法,我将 p49 的系数设为 NA,并显示消息1 由于奇点而未定义".我试过了

However, with all approaches I tried I get the coefficient for p49 as NA, with the message "1 not defined because of singularities". I tried

modspec <- paste("win.frac ~", paste("p", 1:49, sep="", collapse=" + "))
fit1 <- lm(as.formula(modspec), data=dat)
fit2 <- lm(win.frac ~ ., data=dat)

有趣的是,如果我使用 48 个解释变量,则回归有效.这可能 (p2, ..., p49) 或可能不 (p1, ..., p48) 包含 p49,因此我认为这与变量 p49 本身无关.我还尝试了更大的 n 值,结果相同.

Interestingly, the regression works if I use 48 explanatory variables. This may (p2, ..., p49) or may not (p1, ..., p48) contain the p49, hence I think this is not related to the variable p49 itself. I also tried larger values of n, with the same result.

我还尝试了 betareg 包中的 betareg,因为 win.frac 被限制在 0 和 1 之间.这种情况下的回归失败同样,错误消息(粗略翻译)优化(...)中的错误:指定优化的非有限值"

I also tried betareg from the betareg package, since win.frac is restricted between 0 and 1. The regression in this case fails too, with the error message (roughly translated) "error in optim(...): non-finite value of optim specified"

library(betareg)
fit3 <- betareg(as.formula(modspec), data=dat, link="log")

现在我卡住了.我怎样才能执行这个回归?是否有最大变量?这个问题是因为解释变量不是 0 就是 1?

Now I am stuck. How can I perform this regression? Is there a maximum of variables? Is this problem due to the fact that the explanatory variables are either 0 or 1?

非常感谢任何提示!

推荐答案

我假设这些是虚拟编码的因子变量.

I assume that those are dummy encoded factor variables.

如果您执行以下操作,您会发现如果您尝试将一个回归量与所有其他回归量建模,就会得到完美的拟合:

If you do the following you can see that you get a perfect fit if you try to model one of your regressors with all others:

regressormod <- lm(p49 ~ . - win.frac, data = dat)
summary(regressormod)$r.sq
#[1] 1

(在数学上)不可能在回归模型中包含来自虚拟编码因子变量的所有系数,该模型还包括截距 (请参阅交叉验证的此答案).这就是为什么 R 在默认情况下排除一个因子级别的原因,如果您让它为您进行虚拟编码.

It's (mathematically) impossible to include all coeffcients from dummy-encoded factor variables in a regression model that also includes an intercept (see this answer on Cross Validated). That's why R excludes one factor level by default if you let it do the dummy encoding for you.

这篇关于如何在回归模型中指定协变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆