R中的lm函数未给出分类数据中所有因子水平的系数 [英] lm function in R does not give coefficients for all factor levels in categorical data
问题描述
我正在尝试使用分类属性使用R进行线性回归,并观察到我没有获得每个不同因子水平的系数值.
I was trying out linear regression with R using categorical attributes and observe that I don't get a coefficient value for each of the different factor levels I have.
请参见下面的代码,状态的因子水平为5,但是系数的值仅为4.
Please see my code below, I have 5 factor levels for states, but see only 4 values of co-efficients.
> states = c("WA","TE","GE","LA","SF")
> population = c(0.5,0.2,0.6,0.7,0.9)
> df = data.frame(states,population)
> df
states population
1 WA 0.5
2 TE 0.2
3 GE 0.6
4 LA 0.7
5 SF 0.9
> states=NULL
> population=NULL
> lm(formula=population~states,data=df)
Call:
lm(formula = population ~ states, data = df)
Coefficients:
(Intercept) statesLA statesSF statesTE statesWA
0.6 0.1 0.3 -0.4 -0.1
我还通过以下操作尝试使用更大的数据集,但仍然看到相同的行为
I also tried with a larger data set by doing the following, but still see the same behavior
for(i in 1:10)
{
df = rbind(df,df)
}
感谢eipi10,MrFlick和经济组织的回应.我现在知道其中一个级别被用作参考级别.但是,当我获得状态值为"GE"的新测试数据时,如何用等式y = m1x1 + m2x2 + ... + c代替?
EDIT : Thanks to responses from eipi10, MrFlick and economy. I now understand one of the levels is being used as reference level. But when I get a new test data whose state's value is "GE", how do I substitute in the equation y=m1x1+m2x2+...+c ?
我还尝试将数据展平,以使每个因子水平都将其放在单独的列中,但是对于其中一列,我再次将NA作为系数.如果我有状态为"WA"的新测试数据,如何获得填充值"?我要用什么替代它的系数?
I also tried flattening out the data such that each of these factor levels gets it's separate column, but again for one of the column, I get NA as coefficient. If I have a new test data whose state is 'WA', how can I get the 'population value'? What do I substitute as it's coefficient?
> df1
人口GE MI TE WA 1 1 0 0 0 1 2 2 1 0 0 0 3 2 0 0 1 0 4 1 0 1 0 0
population GE MI TE WA 1 1 0 0 0 1 2 2 1 0 0 0 3 2 0 0 1 0 4 1 0 1 0 0
lm(公式=人口〜(GE + MI + TE + WA),data = df1)
lm(formula = population ~ (GE+MI+TE+WA),data=df1)
Call:
lm(formula = population ~ (GE + MI + TE + WA), data = df1)
Coefficients:
(Intercept) GE MI TE WA
1 1 0 1 NA
推荐答案
GE
按字母顺序删除,作为拦截项.如eipi10所述,您可以使用GE
作为基线来解释states
中其他级别的系数(statesLA = 0.1
意味着LA平均比GE高0.1倍).
GE
is dropped, alphabetically, as the intercept term. As eipi10 stated, you can interpret the coefficients for the other levels in states
with GE
as the baseline (statesLA = 0.1
meaning LA is, on average, 0.1x more than GE).
要回答您更新的问题:
如果将所有级别包括在线性回归中,那么您将遇到一种称为完美共线性的情况,当您将每个类别强制放入其自己的变量时,这会导致您看到奇怪的结果.我不会对此进行解释,只是找到一个Wiki,并且知道如果变量系数被完全表示(并且您还期望有一个截距项),则线性回归将不起作用.如果要查看回归中的所有级别,可以按照注释中的建议执行不带截距项的回归,但是同样,除非您有特殊原因,否则这是不明智的选择.
If you include all of the levels in a linear regression, you're going to have a situation called perfect collinearity, which is responsible for the strange results you're seeing when you force each category into its own variable. I won't get into the explanation of that, just find a wiki, and know that linear regression doesn't work if the variable coefficients are completely represented (and you're also expecting an intercept term). If you want to see all of the levels in a regression, you can perform a regression without an intercept term, as suggested in the comments, but again, this is ill-advised unless you have a specific reason to.
对于y=mx+c
方程中的GE
的解释,您可以通过知道其他状态的级别是二进制(零或一个),并且如果该状态是GE,来计算期望的y
.它们都将为零.
As for the interpretation of GE
in your y=mx+c
equation, you can calculate the expected y
by knowing that the levels of the other states are binary (zero or one), and if the state is GE, they will all be zero.
例如
y = x1b1 + x2b2 + x3b3 + c
y = b1(0) + b2(0) + b3(0) + c
y = c
如果没有其他变量,例如在第一个示例中,GE的效果将等于截距项(0.6).
If you don't have any other variables, like in your first example, the effect of GE will be equal to the intercept term (0.6).
这篇关于R中的lm函数未给出分类数据中所有因子水平的系数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!