lm() 中的 poly():原始与正交之间的差异 [英] poly() in lm(): difference between raw vs. orthogonal

查看:83
本文介绍了lm() 中的 poly():原始与正交之间的差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有

library(ISLR)
attach(Wage)

# Polynomial Regression and Step Functions

fit=lm(wage~poly(age,4),data=Wage)
coef(summary(fit))

fit2=lm(wage~poly(age,4,raw=T),data=Wage)
coef(summary(fit2))

plot(age, wage)
lines(20:350, predict(fit, newdata = data.frame(age=20:350)), lwd=3, col="darkred")
lines(20:350, predict(fit2, newdata = data.frame(age=20:350)), lwd=3, col="darkred")

预测线似乎是一样的,但为什么系数如此不同?你如何在 raw=Traw=F 中解释它们.

The prediction lines seem to be the same, however why are the coefficients so different? How do you intepret them in raw=T and raw=F.

我看到 poly(...,raw=T) 产生的系数与 ~age+I(age^2)+I(age^3) 产生的系数相匹配+I(age^4).

I see that the coefficients produced with poly(...,raw=T) match the ones with ~age+I(age^2)+I(age^3)+I(age^4).

如果我想使用系数来手动"获得预测(不使用 predict() 函数)有什么我应该注意的吗?我应该如何解释 poly() 中正交多项式的系数.

If I want to use the coefficients to get the prediction "manually" (without using the predict() function) is there something I should pay attention to? How should I interpret the coefficients of the orthogonal polynomials in poly().

推荐答案

默认情况下,使用 raw = FALSEpoly() 计算正交多项式.它首先在内部使用原始编码 x, x^2, x^3, ... 设置模型矩阵,然后缩放列,使每一列与前一列正交.这不会改变拟合值,但优点是您可以看到多项式中的某个阶是否显着改善了较低阶的回归.

By default, with raw = FALSE, poly() computes an orthogonal polynomial. It internally sets up the model matrix with the raw coding x, x^2, x^3, ... first and then scales the columns so that each column is orthogonal to the previous ones. This does not change the fitted values but has the advantage that you can see whether a certain order in the polynomial significantly improves the regression over the lower orders.

考虑简单的cars 数据,响应停止distance 并驱动speed.从物理上讲,这应该具有二次关系,但在这个(旧)数据集中,二次项并不重要:

Consider the simple cars data with response stopping distance and driving speed. Physically, this should have a quadratic relationship but in this (old) dataset the quadratic term is not significant:

m1 <- lm(dist ~ poly(speed, 2), data = cars)
m2 <- lm(dist ~ poly(speed, 2, raw = TRUE), data = cars)

在正交编码中,您在 summary(m1) 中得到以下系数:

In the orthogonal coding you get the following coefficients in summary(m1):

                Estimate Std. Error t value Pr(>|t|)    
(Intercept)       42.980      2.146  20.026  < 2e-16 ***
poly(speed, 2)1  145.552     15.176   9.591 1.21e-12 ***
poly(speed, 2)2   22.996     15.176   1.515    0.136    

这表明存在非常显着的线性效应,而二阶不显着.后一个 p 值(即多项式中的最高阶之一)与原始编码中的相同:

This shows that there is a highly significant linear effect while the second order is not significant. The latter p-value (i.e., the one of the highest order in the polynomial) is the same as in the raw coding:

                            Estimate Std. Error t value Pr(>|t|)
(Intercept)                  2.47014   14.81716   0.167    0.868
poly(speed, 2, raw = TRUE)1  0.91329    2.03422   0.449    0.656
poly(speed, 2, raw = TRUE)2  0.09996    0.06597   1.515    0.136

但低阶 p 值变化很大.原因是在模型 m1 中回归量是正交的,而它们在 m2 中高度相关:

but the lower order p-values change dramatically. The reason is that in model m1 the regressors are orthogonal while they are highly correlated in m2:

cor(model.matrix(m1)[, 2], model.matrix(m1)[, 3])
## [1] 4.686464e-17
cor(model.matrix(m2)[, 2], model.matrix(m2)[, 3])
## [1] 0.9794765

因此,在原始编码中,如果 speed^2 保留在模型中,您只能解释 speed 的 p 值.由于两个回归量高度相关,因此可以删除其中一个.但是,在正交编码中speed^2只捕获了线性项没有捕获的二次部分.然后很明显,线性部分是显着的,而二次部分没有额外的显着性.

Thus, in the raw coding you can only interpret the p-value of speed if speed^2 remains in the model. And as both regressors are highly correlated one of them can be dropped. However, in the orthogonal coding speed^2 only captures the quadratic part that has not been captured by the linear term. And then it becomes clear that the linear part is significant while the quadratic part has no additional significance.

这篇关于lm() 中的 poly():原始与正交之间的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆