R:使用"weights"参数和使用手动重新加权的数据时,lm()结果有所不同 [英] R: lm() result differs when using `weights` argument and when using manually reweighted data

查看:179
本文介绍了R:使用"weights"参数和使用手动重新加权的数据时,lm()结果有所不同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了以错误的方式纠正异方差,我在R中运行以下加权最小二乘回归:

In order to correct heteroskedasticity in error terms, I am running the following weighted least squares regression in R :

#Call:
#lm(formula = a ~ q + q2 + b + c, data = mydata, weights = weighting)

#Weighted Residuals:
#     Min       1Q   Median       3Q      Max 
#-1.83779 -0.33226  0.02011  0.25135  1.48516 

#Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
#(Intercept) -3.939440   0.609991  -6.458 1.62e-09 ***
#q            0.175019   0.070101   2.497 0.013696 *  
#q2           0.048790   0.005613   8.693 8.49e-15 ***
#b            0.473891   0.134918   3.512 0.000598 ***
#c            0.119551   0.125430   0.953 0.342167    
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

#Residual standard error: 0.5096 on 140 degrees of freedom
#Multiple R-squared:  0.9639,   Adjusted R-squared:  0.9628 
#F-statistic: 933.6 on 4 and 140 DF,  p-value: < 2.2e-16

其中加权"是用于加权观测值的变量(变量q的函数). q2就是q^2.

Where "weighting" is a variable (function of the variable q) used for weighting the observations. q2 is simply q^2.

现在,要仔细检查结果,我通过创建新的加权变量来手动加权变量:

Now, to double-check my results, I manually weight my variables by creating new weighted variables :

mydata$a.wls <- mydata$a * mydata$weighting
mydata$q.wls <- mydata$q * mydata$weighting
mydata$q2.wls <- mydata$q2 * mydata$weighting
mydata$b.wls <- mydata$b * mydata$weighting
mydata$c.wls <- mydata$c * mydata$weighting

运行以下回归分析,不带权重选项,也没有常数-由于常数是加权的,因此原始预测变量矩阵中的1列现在应等于变量权重:

And run the following regression, without the weights option, and without a constant - since the constant is weighted, the column of 1 in the original predictor matrix should now equal the variable weighting:

Call:
lm(formula = a.wls ~ 0 + weighting + q.wls + q2.wls + b.wls + c.wls, 
data = mydata)

#Residuals:
#     Min       1Q   Median       3Q      Max 
#-2.38404 -0.55784  0.01922  0.49838  2.62911 

#Coefficients:
#         Estimate Std. Error t value Pr(>|t|)    
#weighting -4.125559   0.579093  -7.124 5.05e-11 ***
#q.wls    0.217722   0.081851   2.660 0.008726 ** 
#q2.wls   0.045664   0.006229   7.330 1.67e-11 ***
#b.wls    0.466207   0.121429   3.839 0.000186 ***
#c.wls    0.133522   0.112641   1.185 0.237876    
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

#Residual standard error: 0.915 on 140 degrees of freedom
#Multiple R-squared:  0.9823,   Adjusted R-squared:  0.9817 
#F-statistic:  1556 on 5 and 140 DF,  p-value: < 2.2e-16

如您所见,结果相似但不相同.手动加权变量时我做错了吗?或者权重"选项的作用不只是简单地将变量乘以加权向量?

As you can see, the results are similar but not identical. Am I doing something wrong while manually weighting the variables, or does the option "weights" do something more than simply multiplying the variables by the weighting vector?

推荐答案

如果您正确进行了手动加权,则不会出现差异.

Provided you do manual weighting correctly, you won't see discrepancy.

所以正确的方法是:

X <- model.matrix(~ q + q2 + b + c, mydata)  ## non-weighted model matrix (with intercept)
w <- mydata$weighting  ## weights
rw <- sqrt(w)    ## root weights
y <- mydata$a    ## non-weighted response
X_tilde <- rw * X    ## weighted model matrix (with intercept)
y_tilde <- rw * y    ## weighted response

## remember to drop intercept when using formula
fit_by_wls <- lm(y ~ X - 1, weights = w)
fit_by_ols <- lm(y_tilde ~ X_tilde - 1)

尽管通常建议直接传递矩阵时使用lm.fitlm.wfit:

Although it is generally recommended to use lm.fit and lm.wfit when passing in matrix directly:

matfit_by_wls <- lm.wfit(X, y, w)
matfit_by_ols <- lm.fit(X_tilde, y_tilde)

但是,当使用这些内部子例程lm.fitlm.wfit时,要求所有输入都是没有NA的完整情况,否则底层C例程stats:::C_Cdqrls会抱怨.

But when using these internal subroutines lm.fit and lm.wfit, it is required that all input are complete cases without NA, otherwise the underlying C routine stats:::C_Cdqrls will complain.

如果您仍然想使用公式界面而不是矩阵,则可以执行以下操作:

If you still want to use the formula interface rather than matrix, you can do the following:

## weight by square root of weights, not weights
mydata$root.weighting <- sqrt(mydata$weighting)
mydata$a.wls <- mydata$a * mydata$root.weighting
mydata$q.wls <- mydata$q * mydata$root.weighting
mydata$q2.wls <- mydata$q2 * mydata$root.weighting
mydata$b.wls <- mydata$b * mydata$root.weighting
mydata$c.wls <- mydata$c * mydata$root.weighting

fit_by_wls <- lm(formula = a ~ q + q2 + b + c, data = mydata, weights = weighting)

fit_by_ols <- lm(formula = a.wls ~ 0 + root.weighting + q.wls + q2.wls + b.wls + c.wls,
                 data = mydata)


可复制示例

让我们使用R的内置数据集trees.使用head(trees)检查此数据集.该数据集中没有NA.我们旨在拟合模型:

Let's use R's built-in data set trees. Use head(trees) to inspect this dataset. There is no NA in this dataset. We aim to fit a model:

Height ~ Girth + Volume

,随机权重在1到2之间:

with some random weights between 1 and 2:

set.seed(0); w <- runif(nrow(trees), 1, 2)

我们通过加权回归来拟合该模型,方法是将权重传递给lm,或者手动转换数据并调用不具有任何权重的lm:

We fit this model via weighted regression, either by passing weights to lm, or manually transforming data and calling lm with no weigths:

X <- model.matrix(~ Girth + Volume, trees)  ## non-weighted model matrix (with intercept)
rw <- sqrt(w)    ## root weights
y <- trees$Height    ## non-weighted response
X_tilde <- rw * X    ## weighted model matrix (with intercept)
y_tilde <- rw * y    ## weighted response

fit_by_wls <- lm(y ~ X - 1, weights = w)
#Call:
#lm(formula = y ~ X - 1, weights = w)

#Coefficients:
#X(Intercept)        XGirth       XVolume  
#     83.2127       -1.8639        0.5843

fit_by_ols <- lm(y_tilde ~ X_tilde - 1)
#Call:
#lm(formula = y_tilde ~ X_tilde - 1)

#Coefficients:
#X_tilde(Intercept)        X_tildeGirth       X_tildeVolume  
#           83.2127             -1.8639              0.5843

所以确实,我们看到了相同的结果.

So indeed, we see identical results.

或者,我们可以使用lm.fitlm.wfit:

Alternatively, we can use lm.fit and lm.wfit:

matfit_by_wls <- lm.wfit(X, y, w)
matfit_by_ols <- lm.fit(X_tilde, y_tilde)

我们可以通过以下方式检查系数

We can check coefficients by:

matfit_by_wls$coefficients
#(Intercept)       Girth      Volume 
# 83.2127455  -1.8639351   0.5843191 

matfit_by_ols$coefficients
#(Intercept)       Girth      Volume 
# 83.2127455  -1.8639351   0.5843191

再次,结果是相同的.

这篇关于R:使用"weights"参数和使用手动重新加权的数据时,lm()结果有所不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆