R中具有分类变量的线性模型 [英] Linear model with categorical variables in R
问题描述
我正在尝试使用一些分类变量来拟合线性模型
I am trying to fit a lineal model with some categorical variables
model <- lm(price ~ carat+cut+color+clarity)
summary(model)
答案是:
Call:
lm(formula = price ~ carat + cut + color + clarity)
Residuals:
Min 1Q Median 3Q Max
-11495.7 -688.5 -204.1 458.2 9305.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3696.818 47.948 -77.100 < 2e-16 ***
carat 8843.877 40.885 216.311 < 2e-16 ***
cut.L 755.474 68.378 11.049 < 2e-16 ***
cut.Q -349.587 60.432 -5.785 7.74e-09 ***
cut.C 200.008 52.260 3.827 0.000131 ***
cut^4 12.748 42.642 0.299 0.764994
color.L 1905.109 61.050 31.206 < 2e-16 ***
color.Q -675.265 56.056 -12.046 < 2e-16 ***
color.C 197.903 51.932 3.811 0.000140 ***
color^4 71.054 46.940 1.514 0.130165
color^5 2.867 44.586 0.064 0.948729
color^6 50.531 40.771 1.239 0.215268
clarity.L 4045.728 108.363 37.335 < 2e-16 ***
clarity.Q -1545.178 102.668 -15.050 < 2e-16 ***
clarity.C 999.911 88.301 11.324 < 2e-16 ***
clarity^4 -665.130 66.212 -10.045 < 2e-16 ***
clarity^5 920.987 55.012 16.742 < 2e-16 ***
clarity^6 -712.168 52.346 -13.605 < 2e-16 ***
clarity^7 1008.604 45.842 22.002 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1167 on 4639 degrees of freedom
Multiple R-squared: 0.9162, Adjusted R-squared: 0.9159
F-statistic: 2817 on 18 and 4639 DF, p-value: < 2.2e-16
但是我不明白为什么答案是".L,.Q,.C,^ 4,...",这是有问题的,但我不知道有什么问题,我已经尝试过每个变量的功能因子.
But I don't understand why the answers are with ".L,.Q,.C,^4, ...", something is wrong but I don't know what is wrong, I already tried with the function factor for each variable.
推荐答案
您遇到的是回归函数如何处理有序"(有序)因子变量,并且默认的对比集是正交多项式对比,最高为n-1级,其中n是该因子的级别数.解释该结果将不是一件容易的事……尤其是在没有自然顺序的情况下.即使存在,并且在这种情况下也很可能存在,您可能不希望使用默认排序(按因子级别按字母顺序排列),并且多项式对比中可能不希望有多个度.
You are encountering how "ordered" ( ordinal ) factor variables are handled by regression functions and the default set of contrasts are orthogonal polynomial contrasts up to degree n-1, where n is the number of levels for that factor. It's not going to be very easy to interpret that result ... especially if there is no natural order. Even if there is, and there might well be in this case, you might not want the default ordering (which is alphabetical by factor level) and you probably don't want to have more than a few of degrees in the polynomial contrasts.
对于ggplot2的钻石数据集,因子水平设置正确,但是大多数新手在偶然发现有序因子时会得到有序级别,例如优秀"<一般"<. 好"< 较差的". (失败)
In the case of ggplot2's diamonds dataset, the factor levels are set up correctly but most newbies when they stumble across ordered factors get ordered levels like "Excellent" <"Fair" < "Good"< "Poor". (Fail)
> levels(diamonds$cut)
[1] "Fair" "Good" "Very Good" "Premium" "Ideal"
> levels(diamonds$clarity)
[1] "I1" "SI2" "SI1" "VS2" "VS1" "VVS2" "VVS1" "IF"
> levels(diamonds$color)
[1] "D" "E" "F" "G" "H" "I" "J"
一种正确使用排序因子的方法是将它们包裹在as.numeric
中,这可以对趋势进行线性检验.
One methid to use ordered factors when they have been set up correctly is to just wrap them in as.numeric
which gives you a linear test of trend.
> contrasts(diamonds$cut) <- contr.treatment(5) # Removes ordering
> model <- lm(price ~ carat+cut+as.numeric(color)+as.numeric(clarity), diamonds)
> summary(model)
Call:
lm(formula = price ~ carat + cut + as.numeric(color) + as.numeric(clarity),
data = diamonds)
Residuals:
Min 1Q Median 3Q Max
-19130.3 -696.1 -176.8 556.9 9599.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5189.460 36.577 -141.88 <2e-16 ***
carat 8791.452 12.659 694.46 <2e-16 ***
cut2 909.433 35.346 25.73 <2e-16 ***
cut3 1129.518 32.772 34.47 <2e-16 ***
cut4 1156.989 32.427 35.68 <2e-16 ***
cut5 1264.128 32.160 39.31 <2e-16 ***
as.numeric(color) -318.518 3.282 -97.05 <2e-16 ***
as.numeric(clarity) 522.198 3.521 148.31 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1227 on 53932 degrees of freedom
Multiple R-squared: 0.9054, Adjusted R-squared: 0.9054
F-statistic: 7.371e+04 on 7 and 53932 DF, p-value: < 2.2e-16
这篇关于R中具有分类变量的线性模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!