在lm中使用样本权重时校正dfs [英] Correcting dfs when using sample weights with lm

查看:78
本文介绍了在lm中使用样本权重时校正dfs的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图弄清楚 lm 中的加权实际上是如何工作的,我看到了这个有7.5年历史的问题,它可以使您对砝码的工作原理有所了解.该问题的数据部分复制并在下面展开.

I was trying to figure out how weighting in lm actually worked and I saw this 7,5 year old question which gives some insight in how weights work. The data from this question is partly copied and expanded on below.

我发布了

I posted this related question, on Cross Validated.

library(plyr)
set.seed(100)
df <- data.frame(uid=1:200,
                      bp=sample(x=c(100:200),size=200,replace=TRUE),
                      age=sample(x=c(30:65),size=200,replace=TRUE),
                      weight=sample(c(1:10),size=200,replace=TRUE),
                      stringsAsFactors=FALSE)

set.seed(100)
df.double_weights <- data.frame(uid=1:200,
                      bp=sample(x=c(100:200),size=200,replace=TRUE),
                      age=sample(x=c(30:65),size=200,replace=TRUE),
                      weight=2*df$weight,
                      stringsAsFactors=FALSE)

df.expand <- ddply(df,
                        c("uid"),
                        function(df) {
                          data.frame(bp=rep(df[,"bp"],df[,"weight"]),
                                     age=rep(df[,"age"],df[,"weight"]),
                                     stringsAsFactors=FALSE)})

df.lm <- lm(bp~age,data=df,weights=weight)
df.double_weights.lm <- lm(bp~age,data=df.double_weights,weights=weight)
df.expand.lm <- lm(bp~age,data=df.expand)

summary(df.lm)
summary(df.double_weights.lm)
summary(df.expand.lm)

这三个data.frame由完全相同的数据组成.但是;

These three data.frames consist of exactly the same data. However;

df 中,有200个观测值经过加权后总计为1178, sum(df.$ weight)== 1178 .

In df there are 200 observations which are weighted to add up to 1178, sum(df.$weight) == 1178.

df.double_weights 中,权重只是加倍了 sum(df.double_weights $ weight)== 2356 .

In df.double_weights, the weights are simply doubled sum(df.double_weights$weight) == 2356.

df.expand 中,有200个加权观测值,而不是1178个非加权观测值.

In df.expand, there are instead of 200, weighted observations, 1178 unweighted observations.

summary(df.lm) summary(df.double_weights.lm)的系数都是相同的,重要性也是如此(这意味着,如果加权工作正常,则权重的绝对大小无关紧要).但是,似乎绝对大小确实很重要,请参见底部结果.

The coefficients for both summary(df.lm) and summary(df.double_weights.lm) are the same, and so is the significance, (which means that, IF THE WEIGHTING WORKS PROPERLY, the absolute size of the weights is irrelevant). It seems however that the absolute size does matter, see bottom result.

但是,对于 summary(df.lm) summary(df.expand.lm),系数相同,但是重要性不同.

However, for summary(df.lm) and summary(df.expand.lm), the coefficients are the same, but the significance differs.

summary(df.lm)

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 165.6545    10.3850  15.951   <2e-16 ***
age          -0.2852     0.2132  -1.338    0.183    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 98.84 on 198 degrees of freedom
Multiple R-squared:  0.008956,  Adjusted R-squared:  0.003951 
F-statistic: 1.789 on 1 and 198 DF,  p-value: 0.1825

summary(df.expand.lm)

             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 165.65446    4.26123   38.88  < 2e-16 ***
age          -0.28524    0.08749   -3.26  0.00115 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 28.68 on 1176 degrees of freedom
Multiple R-squared:  0.008956,  Adjusted R-squared:  0.008114 
F-statistic: 10.63 on 1 and 1176 DF,  p-value: 0.001146

根据@IRTFM,自由度没有正确地相加,提供了以下代码来解决该问题:

According to @IRTFM, the degrees of freedom are not being properly added up, providing this code to fix it:

df.lm.aov <- anova(df.lm)
df.lm.aov$Df[length(df.lm.aov$Df)] <- 
        sum(df.lm$weights)-   
        sum(df.lm.aov$Df[-length(df.lm.aov$Df)]  ) -1
df.lm.aov$`Mean Sq` <- df.lm.aov$`Sum Sq`/df.lm.aov$Df
df.lm.aov$`F value`[1] <- df.lm.aov$`Mean Sq`[1]/
                                        df.lm.aov$`Mean Sq`[2]
df.lm.aov$`Pr(>F)`[1] <- pf(df.lm.aov$`F value`[1], 1, 
                                      df.lm.aov$Df, lower.tail=FALSE)[2]
df.lm.aov

Analysis of Variance Table

Response: bp
            Df Sum Sq Mean Sq F value   Pr(>F)   
age          1   8741  8740.5  10.628 0.001146 **
Residuals 1176 967146   822.4                    

现在,将近8年之后,这个问题仍然存在(这并不意味着几乎所有将加权变量与 R 中的 lm 结合使用的研究也都存在低有效值?)实际上,我的问题是我几乎不了解IRTFM在做什么,或者它与多元回归分析(甚至在幕后使用 lm 的其他函数)有什么关系?)

Now, almost 8 years later, apparently this problem still persists (Does this not mean that almost all research that used weighted variables in combination with lm from R has too low significance values?) More practically, the problem I have is that I hardly understand what IRTFM is doing, or how it relates to multiple regression analysis (or even other functions that use lm under the hood?).

如果在 df.double_weights.lm 上运行IRTFM的解决方案,则会得到不同的结果,因此权重的绝对大小显然很重要.

If we run IRTFM's solution on df.double_weights.lm, we get a different result, so apparently the absolute size of the weights DOES matter.

Analysis of Variance Table

Response: bp
            Df  Sum Sq Mean Sq F value    Pr(>F)    
age          1   17481 17481.0  21.274 4.194e-06 ***
Residuals 2354 1934293   821.7                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

推荐答案

如果我正确地理解了您的问题,那么您在权重栏中的内容通常称为频率权重".通过指出每个协变量组合有多少个观测值,它们可用于节省数据集中的空间.

If I understand your question correctly, what you have in your weights column is often called "frequency weights". They are used to save space in your dataset by indicating how many observations you have for each combination of covariates.

估计具有聚合的"模型的模型.数据集并获取正确的标准误差,您要做的就是纠正 lm 模型中的自由度数.

To estimate a model with an "aggregated" dataset and obtain correct standard errors, all you need to do is correct the number of degrees of freedom in your lm model.

正确的自由度数是观测的总数减去模型中参数的数量.这可以通过将您的 weights 变量的总和或通过查看满"项中的观察总数来计算.数据,然后减去估计的参数数量(即系数).

The correct number of degrees of freedom is the total number of observations, minus the number of parameters in your model. This can be calculated by taking the sum of your weights variable or by looking at the total number of observations in the "full" data, and subtracting the number of parameters estimated (i.e., coefficients).

这是一个更简单的示例,我认为这很清楚:

Here's a simpler example, which I think makes the point clearer:

library(dplyr)
library(modelsummary)

set.seed(1024)

# individual (true) dataset
x <- round(rnorm(1e5))
y <- round(x + x^2 + rnorm(1e5))
ind <- data.frame(x, y)

# aggregated dataset
agg <- ind %>%
  group_by(x, y) %>%
  summarize(freq = n())

models <- list( 
  "True"                = lm(y ~ x, data = ind),
  "Aggregated"          = lm(y ~ x, data = agg),
  "Aggregated & W"      = lm(y ~ x, data = agg, weights=freq),
  "Aggregated & W & DF" = lm(y ~ x, data = agg, weights=freq)
)

现在,我们要更正列表中最后一个模型的自由度数.我们通过获取 freq 列的总和来实现.我们也可以使用 nrow(ind),因为它们是相同的:

Now we want to correct the number of degrees of freedom of the last model in our list. We do this by taking the sum of our freq column. We could also use nrow(ind), since those are identical:

# correct degrees of freedom
models[[4]]$df.residual <- sum(agg$freq) - length(coef(models[[4]]))

最后,我们使用 modelsummary 包总结了所有5个模型.请注意,即使第一个和最后一个模型是使用完整的单个数据集估算的,而最后一个是使用汇总数据估算的,则第一个和最后一个模型是完全相同的:

Finally, we summarize all 5 models using the modelsummary package. Notice that the first and last models are exactly the same, even if the first was estimated using the full individual dataset, and the last was estimated using the aggregated data:

modelsummary(models, fmt=5)

<身体>
正确汇总聚合&W 聚合&W&DF
(拦截) 1.08446 5.51391 1.08446 1.08446
(0.00580)(0.71710)(0.22402)(0.00580)
x 1.00898 0.91001 1.00898 1.00898
(0.00558)(0.30240)(0.21564)(0.00558)
数字肥胖. 1e + 05 69 69 69
R2 0.246 0.119 0.246 0.246
R2调整. 0.246 0.106 0.235 0.999
AIC 405058.1 446.0 474.1 474.1
BIC 405086.7 452.7 480.8 480.8
Log.Lik.-202526.074 -219.977 -234.046 -234.046
F 32676.664 9.056 21.894 32676.664

这篇关于在lm中使用样本权重时校正dfs的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆