R中具有分类变量的线性模型 [英] Linear model with categorical variables in R

查看:443
本文介绍了R中具有分类变量的线性模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用一些分类变量来拟合线性模型

I am trying to fit a lineal model with some categorical variables

model <- lm(price ~ carat+cut+color+clarity)
summary(model)

答案是:

Call:
lm(formula = price ~ carat + cut + color + clarity)

Residuals:
     Min       1Q   Median       3Q      Max 
-11495.7   -688.5   -204.1    458.2   9305.3 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3696.818     47.948 -77.100  < 2e-16 ***
carat        8843.877     40.885 216.311  < 2e-16 ***
cut.L         755.474     68.378  11.049  < 2e-16 ***
cut.Q        -349.587     60.432  -5.785 7.74e-09 ***
cut.C         200.008     52.260   3.827 0.000131 ***
cut^4          12.748     42.642   0.299 0.764994    
color.L      1905.109     61.050  31.206  < 2e-16 ***
color.Q      -675.265     56.056 -12.046  < 2e-16 ***
color.C       197.903     51.932   3.811 0.000140 ***
color^4        71.054     46.940   1.514 0.130165    
color^5         2.867     44.586   0.064 0.948729    
color^6        50.531     40.771   1.239 0.215268    
clarity.L    4045.728    108.363  37.335  < 2e-16 ***
clarity.Q   -1545.178    102.668 -15.050  < 2e-16 ***
clarity.C     999.911     88.301  11.324  < 2e-16 ***
clarity^4    -665.130     66.212 -10.045  < 2e-16 ***
clarity^5     920.987     55.012  16.742  < 2e-16 ***
clarity^6    -712.168     52.346 -13.605  < 2e-16 ***
clarity^7    1008.604     45.842  22.002  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1167 on 4639 degrees of freedom
Multiple R-squared:  0.9162,    Adjusted R-squared:  0.9159 
F-statistic:  2817 on 18 and 4639 DF,  p-value: < 2.2e-16

但是我不明白为什么答案是".L,.Q,.C,^ 4,...",这是有问题的,但我不知道有什么问题,我已经尝试过每个变量的功能因子.

But I don't understand why the answers are with ".L,.Q,.C,^4, ...", something is wrong but I don't know what is wrong, I already tried with the function factor for each variable.

推荐答案

您遇到的是回归函数如何处理有序"(有序)因子变量,并且默认的对比集是正交多项式对比,最高为n-1级,其中n是该因子的级别数.解释该结果将不是一件容易的事……尤其是在没有自然顺序的情况下.即使存在,并且在这种情况下也很可能存在,您可能不希望使用默认排序(按因子级别按字母顺序排列),并且多项式对比中可能不希望有多个度.

You are encountering how "ordered" ( ordinal ) factor variables are handled by regression functions and the default set of contrasts are orthogonal polynomial contrasts up to degree n-1, where n is the number of levels for that factor. It's not going to be very easy to interpret that result ... especially if there is no natural order. Even if there is, and there might well be in this case, you might not want the default ordering (which is alphabetical by factor level) and you probably don't want to have more than a few of degrees in the polynomial contrasts.

对于ggplot2的钻石数据集,因子水平设置正确,但是大多数新手在偶然发现有序因子时会得到有序级别,例如优秀"<一般"<. 好"< 较差的". (失败)

In the case of ggplot2's diamonds dataset, the factor levels are set up correctly but most newbies when they stumble across ordered factors get ordered levels like "Excellent" <"Fair" < "Good"< "Poor". (Fail)

> levels(diamonds$cut)
[1] "Fair"      "Good"      "Very Good" "Premium"   "Ideal"    
> levels(diamonds$clarity)
[1] "I1"   "SI2"  "SI1"  "VS2"  "VS1"  "VVS2" "VVS1" "IF"  
> levels(diamonds$color)
[1] "D" "E" "F" "G" "H" "I" "J"

一种正确使用排序因子的方法是将它们包裹在as.numeric中,这可以对趋势进行线性检验.

One methid to use ordered factors when they have been set up correctly is to just wrap them in as.numeric which gives you a linear test of trend.

> contrasts(diamonds$cut) <- contr.treatment(5) # Removes ordering
> model <- lm(price ~ carat+cut+as.numeric(color)+as.numeric(clarity), diamonds)
> summary(model)

Call:
lm(formula = price ~ carat + cut + as.numeric(color) + as.numeric(clarity), 
    data = diamonds)

Residuals:
     Min       1Q   Median       3Q      Max 
-19130.3   -696.1   -176.8    556.9   9599.8 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         -5189.460     36.577 -141.88   <2e-16 ***
carat                8791.452     12.659  694.46   <2e-16 ***
cut2                  909.433     35.346   25.73   <2e-16 ***
cut3                 1129.518     32.772   34.47   <2e-16 ***
cut4                 1156.989     32.427   35.68   <2e-16 ***
cut5                 1264.128     32.160   39.31   <2e-16 ***
as.numeric(color)    -318.518      3.282  -97.05   <2e-16 ***
as.numeric(clarity)   522.198      3.521  148.31   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1227 on 53932 degrees of freedom
Multiple R-squared:  0.9054,    Adjusted R-squared:  0.9054 
F-statistic: 7.371e+04 on 7 and 53932 DF,  p-value: < 2.2e-16

这篇关于R中具有分类变量的线性模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆