在线性回归中解释类别预测变量的估计 [英] Interpreting estimates of categorical predictors in linear regression

查看:542
本文介绍了在线性回归中解释类别预测变量的估计的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是线性回归的新手,我试图弄清楚如何解释汇总结果.我在解释分类预测变量的估计时遇到困难.考虑以下示例.我在年龄和长度列中添加了数字预测变量和数字目标.

I'm new to linear regression and I'm trying to figure out how to interpret the summary results. I'm having difficulty interpreting the estimates of categorical predictors. Consider the following example. I added the columns age and length to include a numeric predictor and numeric target.

library(MASS)
data <- as.data.frame(HairEyeColor)

data$length <- c(155, 173, 172, 176, 186, 188, 160, 154, 192, 192, 185, 150, 181, 195, 161, 194,
173, 185, 185, 195, 168, 158, 151, 170, 163, 156, 186, 173, 167, 172, 164, 182)
data$age <- c(48, 44, 8, 23, 23, 63, 64, 26, 8, 56, 40, 11, 17, 12, 60, 10, 9, 21, 46, 7, 12, 9, 32, 37, 52, 64, 36, 31, 41, 24)

summary(lm(length ~ Hair + Eye + Sex + age, data))

输出:

         Estimate Std. Error t value Pr(>|t|)    
(Intercept) 182.72906    8.22026  22.229   <2e-16 ***
HairBrown     6.22998    7.45423   0.836    0.412    
HairRed      -0.38261    7.50570  -0.051    0.960    
HairBlond    -0.25860    7.36012  -0.035    0.972    
EyeBlue      -8.44369    7.36646  -1.146    0.263    
EyeHazel      0.06968    7.49589   0.009    0.993    
EyeGreen     -0.15554    7.27704  -0.021    0.983    
SexFemale    -4.92415    5.18308  -0.950    0.352    
age          -0.19084    0.15910  -1.200    0.243

其中大多数都不重要,但现在我们就忽略它.

Most of these aren't significant, but let's ignore that for now.

  1. 关于(拦截)有什么要说的?直观地讲,这是适用类别预测变量(头发=黑色,眼睛=棕色,性别=男性)的基线值并且年龄= 0时的长度值.这是正确的吗?

  1. What is there to say about (Intercept)? Intuitively, I'd say this is the value for length when the baseline values for the categorical predictors (Hair = Black, Eye = Brown, Sex = Male) apply, and when age = 0. Is this correct?

数据集中的长度平均值为173.8125,而估计值为182.72906.这是否意味着对于基线情况,长度估计实际上高于平均长度?

The mean value of length in the dataset is 173.8125, yet the estimate is 182.72906. Does that imply that for the baseline situation, the estimation for length is actually higher than the average length?

与问题2类似的问题:假设Eye = Blue,而所有其他值仍保留为基线.然后,估算值变为174.284(182.72906-8.44369).我可以由此推断出预期的平均长度为174.284,因此仍高于整体平均水平(173.8125)?

A similar question as question 2: Let's say Eye = Blue, and all other values remain as the baseline. The estimate then becomes 174.284 (182.72906 - 8.44369). Can I infer from this that the expected average length is then 174.284 and thus still higher than the overall average (173.8125)?

如何发现哪个预测值/值对长度有正面或负面影响?简单地沿估算方向行不通:负估算仅意味着与基线相比具有负面影响.这是否意味着我只能推断出例如Eye = Blue与 Eye = Brown相比具有负面影响,而不是推断出总体上具有负面影响?

How can I discover which predictor/value has a positive or negative effect on length? Simply taking the direction of the estimate won't work: A negative estimate only means it has a negative impact when compared to the baseline. Does this mean I can only infer that for example Eye = Blue has a negative impact when compared to Eye = Brown, rather than to infer that it has a negative impact in general?

在所有其他行都不是的情况下,(截取)为何很重要?拦截的意义是什么?

How come (Intercept) is significant while all other rows aren't? What does the significance of the intercept stand for?

在仅以头发"作为预测变量的模型上运行时,头发=金发"的方向变为正(请参见下文),而在以前的模型中则为负.那么为每个预测变量分别运行模型是否更明智,以便我可以捕获单个预测变量的真实大小和方向?

When running the model with only Hair as a predictor, the direction of Hair = Blond becomes positive (see below), while it is negative in the previous model. Is it then wiser to run the model separately for each predictor so that I can capture the true size and direction of an individual predictor?

    summary(lm(length ~ Hair, data))


    Estimate Std. Error t value Pr(>|t|)    

    (Intercept)  173.125      5.107  33.900   <2e-16 ***
    HairBrown      4.250      7.222   0.588    0.561    
    HairRed       -2.625      7.222  -0.363    0.719    
    HairBlond      1.125      7.222   0.156    0.877  

谢谢您的帮助.

推荐答案

  1. 是的.虚拟变量是通过对比编码创建的,因此您的截取确实是对基值的预测.

  1. Yes. The dummy variables are created by contrast coding so your intercept is indeed the prediction for base values.

再次如第1点所述.

是的,您可以得出结论,但是差异很小.您应该检查平均值是否落在置信区间内.如果这样做的话,那么平均值和Blue值之间的差异对于实际目的而言并不重要.

Yes you can conclude that, but the difference is small. You should check if the average falls withing the confidence interval or not. If it does then the difference between average and the value for Blue isn't significant for practical purposes.

由于这些都是虚拟变量,因此您可以推断出积极的估计值表明积极的影响,反之亦然.但是,更确切地说,请看一下置信区间.仅当上下间隔都为正时,您才能确信变量具有积极影响.否则它是不可预测的.

Since these are all dummy variables you can infer that a positive estimate indicates positive impact and vice versa. However, to be more precise take a look at the confidence intervals. Only if both the upper and lower intervals are positive you can say with confidence that the variable has positive impact. Otherwise its unpredictable.

由于您的数据无法向模型提供任何有关所有变量均为零时会发生什么情况的信息,因此模型将减少观察值,从而无法对截距做出有意义的预测.您的虚拟变量在任何时候都永远不会为零.

Since your data doesn't provide any information to the model on what happens when all variables are zero, the model will has less observations to make any meaningful prediction about the intercept. Your dummy variables will never be all zero at any point.

是的,您可以执行此操作,但前提是置信区间不包括零.

Yes you can do that, but it will mostly give you only the direction, provided the confidence intervals don't include zero between them.

如果我是我,我会选择其他模型,例如回归树,该模型可以很好地用于分类变量.

If I were you I'd choose a different model like regression trees which are known to work well with categorical variables.

这篇关于在线性回归中解释类别预测变量的估计的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆