为什么尝试过滤/子集倾斜调查设计对象会失败? [英] Why do attempts to filter/subset a raked survey design object fail?

查看:23
本文介绍了为什么尝试过滤/子集倾斜调查设计对象会失败?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试过滤调查设计对象中的行以排除特定的数据子集.在下面的示例中,包含来自几所学校的调查数据,我试图排除来自加利福尼亚州阿拉米达县学校的数据.

I'm trying to filter rows in a survey design object to exclude a particular subset of data. In the example below, which consists of survey data from several schools, I'm trying to exclude data from schools in Alameda County, California.

令人惊讶的是,当调查设计对象包含通过倾斜创建的权重时,尝试过滤数据或对数据进行子集化失败.我认为这是一个错误,但我不确定.为什么倾斜权重的存在会改变尝试对数据进行过滤或子集化的结果?

Surprisingly, when the survey design object includes weights created by raking, attempts to filter or subset the data fail. I think this is a bug, but I'm not sure. Why does the presence of raked weights alter the result of attempting to filter or subset the data?

library(survey)

data(api)

# Declare basic clustered design ----
cluster_design <- svydesign(data = apiclus1,
                            id = ~dnum,
                            weights = ~pw,
                            fpc = ~fpc)

# Add raking weights for school type ----
pop.types <- data.frame(stype=c("E","H","M"), Freq=c(4421,755,1018))
pop.schwide <- data.frame(sch.wide=c("No","Yes"), Freq=c(1072,5122))

raked_design <- rake(cluster_design,
                     sample.margins = list(~stype,~sch.wide),
                     population.margins = list(pop.types, pop.schwide))

# Filter the two different design objects ----
subset_from_raked_design <- subset(raked_design, cname != "Alameda")

subset_from_cluster_design <- subset(cluster_design, cname != "Alameda")

# Count number of rows in the subsets
# Note that they surprisingly differ
  nrow(subset_from_raked_design)
#> [1] 183
  nrow(subset_from_cluster_design)
#> [1] 172

无论您如何尝试对数据进行子集化,都会出现此问题.例如,当您尝试使用行索引仅对前 10 行进行子集化时,会发生以下情况:

This issue occurs no matter how you attempt to subset the data. For example, here's what happens when you try to use row-indexing to subset only the first 10 rows:

nrow(cluster_design[1:10,])
#> 10
nrow(raked_design[1:10,])
#> 183

推荐答案

出现这种行为的原因是 survey 包试图帮助您避免犯统计错误.

This behavior is a result of the fact that the survey package is trying to help you avoid making a statistical mistake.

对于涉及校准/后分层/倾斜的特别复杂的设计,不能简单地通过过滤掉感兴趣的子群之外的数据来计算子群的估计值;这种方法会产生误导性的标准误差和置信区间.

For especially complex designs involving calibration/post-stratification/raking, estimates for sub-populations can't simply be computed by filtering away data from outside of the sub-population of interest; that approach produces misleading standard errors and confidence intervals.

因此,为了避免您遇到这个统计问题,survey 包不允许您完全删除您感兴趣的子集之外的记录.相反,它实际上会记录您要忽略哪些行,然后将概率权重调整为实际上为零.

So to keep you from running into this statistical issue, the survey package doesn't let you completely remove records outside of your subset of interest. Instead, it essentially takes note of which rows you want to ignore and then adjusts the probability weights to be effectively zero.

在此问题的示例中,您可以看到在要过滤掉的行中,它们在 subset_from_raked_design$prob 对象中的值等于 Inf (这实际上意味着数据中的相应行的权重为零.)

In the example from this question, you can see that in the rows that were meant to be filtered away, their value in the subset_from_raked_design$prob object equals Inf (which effectively means the corresponding rows in the data are assigned a weight of zero.)

subset_from_raked_design$prob[1:12]
#> Inf Inf Inf Inf Inf Inf
#> Inf Inf Inf Inf Inf 
#> 0.01986881 ....

raked_design$prob[1:12]
#> 0.01986881 0.03347789 0.03347789 0.03347789 0.03347789 0.03347789
#> 0.03347789 0.03347789 0.03347789 0.02717969 0.02717969
#> 0.01986881 ....

这篇关于为什么尝试过滤/子集倾斜调查设计对象会失败?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆