为什么R中的回归会删除因子变量的索引1? [英] Why does regression in R delete index 1 of a factor variable?

查看:42
本文介绍了为什么R中的回归会删除因子变量的索引1?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 lmglm 函数在 R 中进行回归.

I am trying to do a regression in R using the lm and the glm function.

我的因变量是基于给定时间段内事件对非事件的比例进行logit转换的数据.因此,我的因变量是连续变量,而我的自变量是因子变量或虚拟变量.

My dependent variable is logit transformed data based on proportion of events over non-events within a given time period. So my dependent variable is continuous whereas my independent variable are factor variable or dummies.

我有两个自变量,可以取值

I have two independent variables that can take the values of

  • 第i年到第m年,我的YEAR变量
  • 第j个月到第n个月,我的MONTH变量

问题是,每当我运行模型作为摘要时,结果4月(月的索引1)和1998年(年的索引1)都不在结果内...如果我将4月更改为"foo_bar",八月将会丢失...

The problem is that whenever I run my model as summaries the results April(index 1 for month) and 1998 (index 1 for year) is not within the results... if I change April to let's say "foo_bar", August will be missing...

请帮助!这让我感到沮丧,我根本不知道如何寻找解决问题的方法.

Please help! This is frustrating me and I simply do not know how to search for a solution to the problem.

推荐答案

如果R要为因子中的每个级别创建一个虚拟变量,则变量的结果集将线性相关(假设还有一个拦截项).因此,选择一个因子水平作为基线,并且不会为其生成任何虚拟变量.

If R were to create a dummy variable for every level in the factor, the resulting set of variables would be linearly dependent (assuming there is also an intercept term). Therefore, one factor level is chosen as the baseline and has no dummy generated for it.

为说明这一点,让我们考虑一个玩具示例:

To illustrate this, let's consider a toy example:

> data <- data.frame(y=c(2, 3, 5, 7, 11, 25), f=as.factor(c('a', 'a', 'b', 'b', 'c', 'c')))
> summary(lm(y ~ f, data))

Call:
lm(formula = y ~ f, data = data)

Residuals:
   1    2    3    4    5    6 
-0.5  0.5 -1.0  1.0 -7.0  7.0 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)    2.500      4.093   0.611   0.5845  
fb             3.500      5.788   0.605   0.5880  
fc            15.500      5.788   2.678   0.0752 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 5.788 on 3 degrees of freedom
Multiple R-squared: 0.7245, Adjusted R-squared: 0.5409 
F-statistic: 3.945 on 2 and 3 DF,  p-value: 0.1446 

如您所见,存在三个系数(与系数中的级别数相同).在这里,选择了 a 作为基线,所以(Intercept)是指其中 f a 的数据子集代码>. b c ( fb fc )的系数是之间的差异基线截距和其他两个因子水平的截距.因此, b 的截距是 6 ( 2.500 + 3.500 ),而 c 的截距是19( 2.500 + 15.500 ).

As you can see, there are three coefficients (the same as the number of levels in the factor). Here, a has been chosen as the baseline, so (Intercept) refers to the subset of data where f is a. The coefficients for b and c (fb and fc) are the differences between the baseline intercept and the intercepts for the two other factor levels. Thus the intercept for b is 6 (2.500+3.500) and the intercept for c is 19 (2.500+15.500).

如果您不喜欢自动选择,则可以选择其他级别作为基准:

If you don't like the automatic choice, you could pick another level as the baseline: How to force R to use a specified factor level as reference in a regression?

这篇关于为什么R中的回归会删除因子变量的索引1?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆