具有包含NA的数据的聚类标准错误 [英] Clustered Standard Errors with data containing NAs

查看:201
本文介绍了具有包含NA的数据的聚类标准错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

基于此帖子. cl函数返回错误:

I'm unable to cluster standard errors using R and guidance based on this post. The cl function returns the error:

Error in tapply(x, cluster1, sum) : arguments must have same length

tapply上进行阅读之后,我仍然不确定为什么我的集群参数长度错误,以及导致此错误的原因.

After reading up on tapply I'm still not sure why my cluster argument is the wrong length, and what is causing this error.

这里是指向我正在使用的数据集的链接.

Here is a link to the data set that I'm using.

https://www .dropbox.com/s/y2od7um9pp4vn0s/Ec%201820%20-%20DD%20Data%20with%20Controls.csv

这是R代码:

# read in data
charter<-read.csv(file.choose())
View(charter)
colnames(charter)

# standardize NAEP scores
charter$naep.standardized <- (charter$naep - mean(charter$naep, na.rm=T))/sd(charter$naep, na.rm=T)

# change NAs in year.passed column to 2014
charter$year.passed[is.na(charter$year.passed)]<-2014

# Add column with indicator for in treatment (passed legislation)
charter$treatment<-ifelse(charter$year.passed<=charter$year,1,0)

# fit model
charter.model<-lm(naep ~ factor(year) + factor(state) + treatment, data = charter)
summary(charter.model)
# account for clustered standard errors by state
cl(dat=charter, fm=charter.model, cluster=charter$state)

# accounting for controls
charter.model.controls<-lm(naep~factor)

# clustered standard errors
# ---------

# function that calculates clustered standard errors
# source: http://thetarzan.wordpress.com/2011/06/11/clustered-standard-errors-in-r/
cl   <- function(dat, fm, cluster){
  require(sandwich, quietly = TRUE)
  require(lmtest, quietly = TRUE)
  M <- length(unique(cluster))
  N <- length(cluster)
  K <- fm$rank
  dfc <- (M/(M-1))*((N-1)/(N-K))
  print(K)
  uj  <- apply(estfun(fm),2, function(x) tapply(x, cluster, sum));
  vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N)
  coeftest(fm, vcovCL) 
}

# calculate clustered standard errors 
cl(charter, charter.model, charter$state)

该功能的内部工作有点麻烦.

The inner workings of the function are a little over my head.

推荐答案

在执行代码时,请注意线性模型中缺少观测值:

When you execute your code, notice that there are missing observations in the linear model:

> summary(charter.model)

Call:
lm(formula = naep ~ factor(year) + factor(state) + treatment, 
    data = charter)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.2420  -1.6740  -0.2024   1.8345  12.3580 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 250.4983     1.2115 206.767  < 2e-16 ***
factor(year)1992              3.7970     0.7198   5.275 2.17e-07 ***
factor(year)1996              7.0436     0.8607   8.183 3.64e-15 ***

[..]

Residual standard error: 3.128 on 404 degrees of freedom
  (759 observations deleted due to missingness)
Multiple R-squared:  0.9337,    Adjusted R-squared:  0.9239 
F-statistic: 94.85 on 60 and 404 DF,  p-value: < 2.2e-16

这就是导致您看到的Error in tapply(x, cluster1, sum) : arguments must have same length错误消息的原因.

This is what is causing the Error in tapply(x, cluster1, sum) : arguments must have same length error message that you see.

cl(dat=charter, fm=charter.model, cluster=charter$state)中,聚类变量charter$state的长度应与回归估计中有效使用的观察数的长度完全相同(由于NA与原始数据帧中的行数不同)

In cl(dat=charter, fm=charter.model, cluster=charter$state) the cluster variable charter$state should have the exact same length as the number of observations effectively used in the regression estimation (which due to NAs is NOT the same as the number of rows in the original data frame).

要解决此问题,您可以执行以下操作.

To work around this you can do the following.

  1. 首先,您使用的是Arai函数的旧版本(cl)(请参阅 R 以同时引用旧版本或新版本,后者称为clx).

  1. First off you're using an older version of Arai's function (cl) (see Fama-MacBeth and Cluster-Robust (by Firm and Time) Standard Errors in R for references to both the old or the new versions, the latter being called clx).

第二,我认为Arai对该函数的原始方法有些复杂,并且并没有真正遵循sandwichvcov*函数的标准接口.这就是为什么我附带了clx的略微修改版本的原因.我使代码更具可读性,并且界面更像您从sandwich vcov*函数所期望的那样:

Second I think Arai's original approach to this function is a bit convoluted, and doesn't really follow the standard interface of vcov* functions from sandwich. That's why I came with a slightly modified version of clx. I made the code a bit more readable, and the interface more like what you would expect from a sandwich vcov* function:

vcovCL <- function(x, cluster.by, type="sss", dfcw=1){
    # R-codes (www.r-project.org) for computing
    # clustered-standard errors. Mahmood Arai, Jan 26, 2008.

    # The arguments of the function are:
    # fitted model, cluster1 and cluster2
    # You need to install libraries `sandwich' and `lmtest'

    # reweighting the var-cov matrix for the within model
    require(sandwich)
    cluster <- cluster.by
    M <- length(unique(cluster))   
    N <- length(cluster)
    stopifnot(N == length(x$residuals))
    K <- x$rank
    ##only Stata small-sample correction supported right now 
    ##see plm >= 1.5-4
    stopifnot(type=="sss")  
    if(type=="sss"){
        dfc <- (M/(M-1))*((N-1)/(N-K))
    }
    uj  <- apply(estfun(x), 2, function(y) tapply(y, cluster, sum))
    mycov <- dfc * sandwich(x, meat=crossprod(uj)/N) * dfcw
    return(mycov)
}

如果在数据上尝试使用此功能,则会发现它解决了此特定问题:

If you try this function on the data you will see that it catches this specific issue:

> coeftest(charter.model, vcov=function(x) vcovCL(x, charter$state))
 Error: N == length(x$residuals) is not TRUE

为避免该问题,您可以按照以下步骤操作:

To avoid the issue you could proceed as follows:

> charter.x <- na.omit(charter[ , c("state", 
                                  all.vars(formula(charter.model)))])
> coeftest(charter.model, vcov=function(x) vcovCL(x, charter.x$state)) 

t test of coefficients:

                               Estimate  Std. Error     t value  Pr(>|t|)    
(Intercept)                  2.5050e+02  9.3781e-01  2.6711e+02 < 2.2e-16 ***
factor(year)1992             3.7970e+00  5.6019e-01  6.7780e+00 4.330e-11 ***
factor(year)1996             7.0436e+00  8.8574e-01  7.9522e+00 1.856e-14 ***
factor(year)2000             8.4313e+00  1.0906e+00  7.7311e+00 8.560e-14 ***
factor(year)2003             1.2392e+01  1.1670e+00  1.0619e+01 < 2.2e-16 ***
factor(year)2005             1.3490e+01  1.1747e+00  1.1484e+01 < 2.2e-16 ***
factor(year)2007             1.6334e+01  1.2469e+00  1.3100e+01 < 2.2e-16 ***
factor(year)2009             1.8118e+01  1.2556e+00  1.4430e+01 < 2.2e-16 ***
factor(year)2011             1.9110e+01  1.3459e+00  1.4199e+01 < 2.2e-16 ***
factor(year)2013             1.9301e+01  1.4896e+00  1.2957e+01 < 2.2e-16 ***
factor(state)Alaska          1.4178e+01  8.7686e-01  1.6169e+01 < 2.2e-16 ***
factor(state)Arizona         8.6313e+00  8.1439e-01  1.0598e+01 < 2.2e-16 ***
factor(state)Arkansas        4.3313e+00  8.1439e-01  5.3185e+00 1.736e-07 ***
factor(state)California      3.1103e+00  9.1619e-01  3.3948e+00 0.0007549 ***
factor(state)Colorado        1.7939e+01  7.9736e-01  2.2498e+01 < 2.2e-16 ***
factor(state)Connecticut     1.8031e+01  8.1439e-01  2.2141e+01 < 2.2e-16 ***
factor(state)D.C.           -1.8369e+01  8.1439e-01 -2.2555e+01 < 2.2e-16 ***
factor(state)Delaware        1.2050e+01  7.9736e-01  1.5113e+01 < 2.2e-16 ***
factor(state)Florida         7.3838e+00  7.9736e-01  9.2602e+00 < 2.2e-16 ***
factor(state)Georgia         6.4313e+00  8.1439e-01  7.8971e+00 2.724e-14 ***
factor(state)Hawaii          3.3313e+00  8.1439e-01  4.0906e+00 5.196e-05 ***
factor(state)Idaho           1.7118e+01  7.8321e-01  2.1857e+01 < 2.2e-16 ***
factor(state)Illinois        1.2670e+01  8.2224e-01  1.5409e+01 < 2.2e-16 ***
factor(state)Indianna        1.7174e+01  6.1079e-01  2.8117e+01 < 2.2e-16 ***
factor(state)Iowa            2.0074e+01  6.8460e-01  2.9322e+01 < 2.2e-16 ***
factor(state)Kansas          2.0123e+01  8.6796e-01  2.3184e+01 < 2.2e-16 ***
factor(state)Kentucky        1.0200e+01  4.1999e-14  2.4287e+14 < 2.2e-16 ***
factor(state)Louisiana      -1.6866e-01  8.1439e-01 -2.0710e-01 0.8360322    
factor(state)Maine           2.0231e+01  1.7564e-01  1.1518e+02 < 2.2e-16 ***
factor(state)Maryland        1.4274e+01  6.1079e-01  2.3369e+01 < 2.2e-16 ***
factor(state)Massachusetts   2.4868e+01  8.3960e-01  2.9619e+01 < 2.2e-16 ***
factor(state)Michigan        1.2031e+01  8.1439e-01  1.4773e+01 < 2.2e-16 ***
factor(state)Minnesota       2.5110e+01  9.1619e-01  2.7407e+01 < 2.2e-16 ***
factor(state)Mississippi    -3.5470e+00  1.7564e-01 -2.0195e+01 < 2.2e-16 ***
factor(state)Missouri        1.3447e+01  7.2706e-01  1.8495e+01 < 2.2e-16 ***
factor(state)Montana         2.2512e+01  8.4814e-01  2.6543e+01 < 2.2e-16 ***
factor(state)Nebraska        1.9600e+01  4.3105e-14  4.5471e+14 < 2.2e-16 ***
factor(state)Nevada          4.9800e+00  8.6796e-01  5.7375e+00 1.887e-08 ***
factor(state)New Hampshire   2.2026e+01  7.6338e-01  2.8853e+01 < 2.2e-16 ***
factor(state)New Jersey      2.0651e+01  7.6338e-01  2.7052e+01 < 2.2e-16 ***
factor(state)New Mexico      1.5313e+00  8.1439e-01  1.8803e+00 0.0607809 .  
factor(state)New York        1.2152e+01  7.1259e-01  1.7054e+01 < 2.2e-16 ***
factor(state)North Carolina  1.2231e+01  8.1439e-01  1.5019e+01 < 2.2e-16 ***
factor(state)North Dakota    2.4278e+01  1.0420e-01  2.3299e+02 < 2.2e-16 ***
factor(state)Ohio            1.7118e+01  7.8321e-01  2.1857e+01 < 2.2e-16 ***
factor(state)Oklahoma        8.4518e+00  7.8321e-01  1.0791e+01 < 2.2e-16 ***
factor(state)Oregon          1.6535e+01  7.3538e-01  2.2486e+01 < 2.2e-16 ***
factor(state)Pennsylvania    1.6651e+01  7.6338e-01  2.1812e+01 < 2.2e-16 ***
factor(state)Rhode Island    9.5313e+00  8.1439e-01  1.1704e+01 < 2.2e-16 ***
factor(state)South Carolina  9.5346e+00  8.3960e-01  1.1356e+01 < 2.2e-16 ***
factor(state)South Dakota    2.1211e+01  3.5103e-01  6.0425e+01 < 2.2e-16 ***
factor(state)Tennessee       4.9148e+00  6.1473e-01  7.9951e+00 1.375e-14 ***
factor(state)Texas           1.4231e+01  8.1439e-01  1.7475e+01 < 2.2e-16 ***
factor(state)Utah            1.5114e+01  7.2706e-01  2.0787e+01 < 2.2e-16 ***
factor(state)Vermont         2.3474e+01  2.0299e-01  1.1564e+02 < 2.2e-16 ***
factor(state)Virginia        1.6252e+01  7.1259e-01  2.2807e+01 < 2.2e-16 ***
factor(state)Washington      1.9073e+01  1.8183e-01  1.0489e+02 < 2.2e-16 ***
factor(state)West Virginia   5.0000e+00  4.2022e-14  1.1899e+14 < 2.2e-16 ***
factor(state)Wisconsin       1.9994e+01  8.2447e-01  2.4251e+01 < 2.2e-16 ***
factor(state)Wyoming         1.8231e+01  8.1439e-01  2.2386e+01 < 2.2e-16 ***
treatment                    1.2108e+00  1.0180e+00  1.1894e+00 0.2349682    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

这不好,但是可以完成工作.现在cl也可以正常工作,并产生与上述相同的结果:

It's not nice, but gets the job done. Now cl will also work just fine and yield the same result as above:

cl(dat=charter, fm=charter.model, cluster=charter.x$state)


解决此问题的更好方法是使用 multiwayvcov 软件包.根据软件包的网站,它是对Arai的代码的改进:


A better way to go about this is to use the multiwayvcov package. As per the packages's website, it is an improvement upon Arai's code:

由于缺失而导致对观察结果的透明处理

Transparent handling of observations dropped due to missingness

将Petersen数据与模拟的NA和cluster.vcov()一起使用:

Using the Petersen data with simulated NAs and cluster.vcov():

library("lmtest")
library("multiwayvcov")

data(petersen)
set.seed(123)
petersen[ sample(1:5000, 15), 3] <- NA

m1 <- lm(y ~ x, data = petersen)
summary(m1)
## 
## Call:
## lm(formula = y ~ x, data = petersen)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.759 -1.371 -0.018  1.340  8.680 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.02793    0.02842   0.983    0.326    
## x            1.03635    0.02865  36.175   <2e-16 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## 
## Residual standard error: 2.007 on 4983 degrees of freedom
##   (15 observations deleted due to missingness)
## Multiple R-squared:  0.208,  Adjusted R-squared:  0.2078 
## F-statistic:  1309 on 1 and 4983 DF,  p-value: < 2.2e-16

coeftest(m1, vcov=function(x) cluster.vcov(x, petersen$firmid))
## 
## t test of coefficients:
## 
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.027932   0.067198  0.4157   0.6777    
## x           1.036354   0.050700 20.4407   <2e-16 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


有关使用plm软件包的其他方法,请参见:


For a different approach using the plm package see:

这篇关于具有包含NA的数据的聚类标准错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆