R中的lm函数未给出分类数据中所有因子水平的系数 [英] lm function in R does not give coefficients for all factor levels in categorical data

查看：197 发布时间：2020/4/30 12:20:41 r linear-regression lm

本文介绍了R中的lm函数未给出分类数据中所有因子水平的系数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用分类属性使用R进行线性回归，并观察到我没有获得每个不同因子水平的系数值.

I was trying out linear regression with R using categorical attributes and observe that I don't get a coefficient value for each of the different factor levels I have.

请参见下面的代码，状态的因子水平为5，但是系数的值仅为4.

Please see my code below, I have 5 factor levels for states, but see only 4 values of co-efficients.

> states = c("WA","TE","GE","LA","SF")
> population = c(0.5,0.2,0.6,0.7,0.9)
> df = data.frame(states,population)
> df
  states population
1     WA   0.5
2     TE   0.2
3     GE   0.6
4     LA   0.7
5     SF   0.9
> states=NULL
> population=NULL
> lm(formula=population~states,data=df)

Call:
lm(formula = population ~ states, data = df)

Coefficients:
(Intercept)     statesLA     statesSF     statesTE     statesWA  
        0.6          0.1          0.3         -0.4         -0.1

我还通过以下操作尝试使用更大的数据集，但仍然看到相同的行为

I also tried with a larger data set by doing the following, but still see the same behavior

for(i in 1:10)
{
    df = rbind(df,df)
}

感谢eipi10，MrFlick和经济组织的回应.我现在知道其中一个级别被用作参考级别.但是，当我获得状态值为"GE"的新测试数据时，如何用等式y = m1x1 + m2x2 + ... + c代替?

EDIT : Thanks to responses from eipi10, MrFlick and economy. I now understand one of the levels is being used as reference level. But when I get a new test data whose state's value is "GE", how do I substitute in the equation y=m1x1+m2x2+...+c ?

我还尝试将数据展平，以使每个因子水平都将其放在单独的列中，但是对于其中一列，我再次将NA作为系数.如果我有状态为"WA"的新测试数据，如何获得填充值"?我要用什么替代它的系数?

I also tried flattening out the data such that each of these factor levels gets it's separate column, but again for one of the column, I get NA as coefficient. If I have a new test data whose state is 'WA', how can I get the 'population value'? What do I substitute as it's coefficient?

> df1

人口GE MI TE WA 1 1 0 0 0 1 2 2 1 0 0 0 3 2 0 0 1 0 4 1 0 1 0 0

population GE MI TE WA 1 1 0 0 0 1 2 2 1 0 0 0 3 2 0 0 1 0 4 1 0 1 0 0

lm(公式=人口〜(GE + MI + TE + WA)，data = df1)

lm(formula = population ~ (GE+MI+TE+WA),data=df1)

Call:
lm(formula = population ~ (GE + MI + TE + WA), data = df1)

Coefficients:
(Intercept)           GE           MI           TE           WA  
          1            1            0            1           NA

推荐答案

GE按字母顺序删除，作为拦截项.如eipi10所述，您可以使用GE作为基线来解释states中其他级别的系数(statesLA = 0.1意味着LA平均比GE高0.1倍).

GE is dropped, alphabetically, as the intercept term. As eipi10 stated, you can interpret the coefficients for the other levels in states with GE as the baseline (statesLA = 0.1 meaning LA is, on average, 0.1x more than GE).

要回答您更新的问题:

如果将所有级别包括在线性回归中，那么您将遇到一种称为完美共线性的情况，当您将每个类别强制放入其自己的变量时，这会导致您看到奇怪的结果.我不会对此进行解释，只是找到一个Wiki，并且知道如果变量系数被完全表示(并且您还期望有一个截距项)，则线性回归将不起作用.如果要查看回归中的所有级别，可以按照注释中的建议执行不带截距项的回归，但是同样，除非您有特殊原因，否则这是不明智的选择.

If you include all of the levels in a linear regression, you're going to have a situation called perfect collinearity, which is responsible for the strange results you're seeing when you force each category into its own variable. I won't get into the explanation of that, just find a wiki, and know that linear regression doesn't work if the variable coefficients are completely represented (and you're also expecting an intercept term). If you want to see all of the levels in a regression, you can perform a regression without an intercept term, as suggested in the comments, but again, this is ill-advised unless you have a specific reason to.

对于y=mx+c方程中的GE的解释，您可以通过知道其他状态的级别是二进制(零或一个)，并且如果该状态是GE，来计算期望的y.它们都将为零.

As for the interpretation of GE in your y=mx+c equation, you can calculate the expected y by knowing that the levels of the other states are binary (zero or one), and if the state is GE, they will all be zero.

例如

y = x1b1 + x2b2 + x3b3 + c
y = b1(0) + b2(0) + b3(0) + c
y = c

如果没有其他变量，例如在第一个示例中，GE的效果将等于截距项(0.6).

If you don't have any other variables, like in your first example, the effect of GE will be equal to the intercept term (0.6).

这篇关于R中的lm函数未给出分类数据中所有因子水平的系数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

R中的lm函数未给出分类数据中所有因子水平的系数 [英] lm function in R does not give coefficients for all factor levels in categorical data

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

R中的lm函数未给出分类数据中所有因子水平的系数 [英] lm function in R does not give coefficients for all factor levels in categorical data

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭