R:使用"weights"参数和使用手动重新加权的数据时,lm()结果有所不同 [英] R: lm() result differs when using `weights` argument and when using manually reweighted data
问题描述
为了以错误的方式纠正异方差,我在R中运行以下加权最小二乘回归:
In order to correct heteroskedasticity in error terms, I am running the following weighted least squares regression in R :
#Call:
#lm(formula = a ~ q + q2 + b + c, data = mydata, weights = weighting)
#Weighted Residuals:
# Min 1Q Median 3Q Max
#-1.83779 -0.33226 0.02011 0.25135 1.48516
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -3.939440 0.609991 -6.458 1.62e-09 ***
#q 0.175019 0.070101 2.497 0.013696 *
#q2 0.048790 0.005613 8.693 8.49e-15 ***
#b 0.473891 0.134918 3.512 0.000598 ***
#c 0.119551 0.125430 0.953 0.342167
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 0.5096 on 140 degrees of freedom
#Multiple R-squared: 0.9639, Adjusted R-squared: 0.9628
#F-statistic: 933.6 on 4 and 140 DF, p-value: < 2.2e-16
其中加权"是用于加权观测值的变量(变量q
的函数). q2
就是q^2
.
Where "weighting" is a variable (function of the variable q
) used for weighting the observations. q2
is simply q^2
.
现在,要仔细检查结果,我通过创建新的加权变量来手动加权变量:
Now, to double-check my results, I manually weight my variables by creating new weighted variables :
mydata$a.wls <- mydata$a * mydata$weighting
mydata$q.wls <- mydata$q * mydata$weighting
mydata$q2.wls <- mydata$q2 * mydata$weighting
mydata$b.wls <- mydata$b * mydata$weighting
mydata$c.wls <- mydata$c * mydata$weighting
运行以下回归分析,不带权重选项,也没有常数-由于常数是加权的,因此原始预测变量矩阵中的1列现在应等于变量权重:
And run the following regression, without the weights option, and without a constant - since the constant is weighted, the column of 1 in the original predictor matrix should now equal the variable weighting:
Call:
lm(formula = a.wls ~ 0 + weighting + q.wls + q2.wls + b.wls + c.wls,
data = mydata)
#Residuals:
# Min 1Q Median 3Q Max
#-2.38404 -0.55784 0.01922 0.49838 2.62911
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#weighting -4.125559 0.579093 -7.124 5.05e-11 ***
#q.wls 0.217722 0.081851 2.660 0.008726 **
#q2.wls 0.045664 0.006229 7.330 1.67e-11 ***
#b.wls 0.466207 0.121429 3.839 0.000186 ***
#c.wls 0.133522 0.112641 1.185 0.237876
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 0.915 on 140 degrees of freedom
#Multiple R-squared: 0.9823, Adjusted R-squared: 0.9817
#F-statistic: 1556 on 5 and 140 DF, p-value: < 2.2e-16
如您所见,结果相似但不相同.手动加权变量时我做错了吗?或者权重"选项的作用不只是简单地将变量乘以加权向量?
As you can see, the results are similar but not identical. Am I doing something wrong while manually weighting the variables, or does the option "weights" do something more than simply multiplying the variables by the weighting vector?
推荐答案
如果您正确进行了手动加权,则不会出现差异.
Provided you do manual weighting correctly, you won't see discrepancy.
所以正确的方法是:
X <- model.matrix(~ q + q2 + b + c, mydata) ## non-weighted model matrix (with intercept)
w <- mydata$weighting ## weights
rw <- sqrt(w) ## root weights
y <- mydata$a ## non-weighted response
X_tilde <- rw * X ## weighted model matrix (with intercept)
y_tilde <- rw * y ## weighted response
## remember to drop intercept when using formula
fit_by_wls <- lm(y ~ X - 1, weights = w)
fit_by_ols <- lm(y_tilde ~ X_tilde - 1)
尽管通常建议直接传递矩阵时使用lm.fit
和lm.wfit
:
Although it is generally recommended to use lm.fit
and lm.wfit
when passing in matrix directly:
matfit_by_wls <- lm.wfit(X, y, w)
matfit_by_ols <- lm.fit(X_tilde, y_tilde)
但是,当使用这些内部子例程lm.fit
和lm.wfit
时,要求所有输入都是没有NA
的完整情况,否则底层C例程stats:::C_Cdqrls
会抱怨.
But when using these internal subroutines lm.fit
and lm.wfit
, it is required that all input are complete cases without NA
, otherwise the underlying C routine stats:::C_Cdqrls
will complain.
如果您仍然想使用公式界面而不是矩阵,则可以执行以下操作:
If you still want to use the formula interface rather than matrix, you can do the following:
## weight by square root of weights, not weights
mydata$root.weighting <- sqrt(mydata$weighting)
mydata$a.wls <- mydata$a * mydata$root.weighting
mydata$q.wls <- mydata$q * mydata$root.weighting
mydata$q2.wls <- mydata$q2 * mydata$root.weighting
mydata$b.wls <- mydata$b * mydata$root.weighting
mydata$c.wls <- mydata$c * mydata$root.weighting
fit_by_wls <- lm(formula = a ~ q + q2 + b + c, data = mydata, weights = weighting)
fit_by_ols <- lm(formula = a.wls ~ 0 + root.weighting + q.wls + q2.wls + b.wls + c.wls,
data = mydata)
可复制示例
让我们使用R的内置数据集trees
.使用head(trees)
检查此数据集.该数据集中没有NA
.我们旨在拟合模型:
Let's use R's built-in data set trees
. Use head(trees)
to inspect this dataset. There is no NA
in this dataset. We aim to fit a model:
Height ~ Girth + Volume
,随机权重在1到2之间:
with some random weights between 1 and 2:
set.seed(0); w <- runif(nrow(trees), 1, 2)
我们通过加权回归来拟合该模型,方法是将权重传递给lm
,或者手动转换数据并调用不具有任何权重的lm
:
We fit this model via weighted regression, either by passing weights to lm
, or manually transforming data and calling lm
with no weigths:
X <- model.matrix(~ Girth + Volume, trees) ## non-weighted model matrix (with intercept)
rw <- sqrt(w) ## root weights
y <- trees$Height ## non-weighted response
X_tilde <- rw * X ## weighted model matrix (with intercept)
y_tilde <- rw * y ## weighted response
fit_by_wls <- lm(y ~ X - 1, weights = w)
#Call:
#lm(formula = y ~ X - 1, weights = w)
#Coefficients:
#X(Intercept) XGirth XVolume
# 83.2127 -1.8639 0.5843
fit_by_ols <- lm(y_tilde ~ X_tilde - 1)
#Call:
#lm(formula = y_tilde ~ X_tilde - 1)
#Coefficients:
#X_tilde(Intercept) X_tildeGirth X_tildeVolume
# 83.2127 -1.8639 0.5843
所以确实,我们看到了相同的结果.
So indeed, we see identical results.
或者,我们可以使用lm.fit
和lm.wfit
:
Alternatively, we can use lm.fit
and lm.wfit
:
matfit_by_wls <- lm.wfit(X, y, w)
matfit_by_ols <- lm.fit(X_tilde, y_tilde)
我们可以通过以下方式检查系数
We can check coefficients by:
matfit_by_wls$coefficients
#(Intercept) Girth Volume
# 83.2127455 -1.8639351 0.5843191
matfit_by_ols$coefficients
#(Intercept) Girth Volume
# 83.2127455 -1.8639351 0.5843191
再次,结果是相同的.
这篇关于R:使用"weights"参数和使用手动重新加权的数据时,lm()结果有所不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!