某些地层太小时的分层样本 [英] Stratified sample when some strata are too small

查看:462
本文介绍了某些地层太小时的分层样本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要在每个层次中绘制一个具有n观测值的分层样本,但是某些层次的观测值少于n.如果一个层的观察值太少(例如k<n个观察值),我想对该层中的所有k个观察值进行采样.

I need to draw a stratified sample with n observation in each stratum, but some strata have fewer observations than n. If a stratum has too few observations (say, k<n observations), I want to sample all k observations from that stratum.

require(sampling)

n <- 10
geo_ID <- c(rep(1, times = 20), rep(2, times = 20), rep(c(1, 2, 3, 4), times = 5))
    set.seed(42)
V1 <- rnorm(60, 0, 1)
V2 <- rnorm(60, 2, 1)

DF <- data.frame(geo_ID = geo_ID, V1 = V1, V2 = V2)
    #Sort as explained in ?strata help file
DF <- DF[order(DF[, "geo_ID"]), ]

strata(DF, stratanames = "geo_ID", size = c(n, n, n, n), method = "srswor")

如果我使用上述采样而不进行替换,我(可以理解)会收到错误消息:

If I use sampling without replacement as above, I (understandably) get the error:

Error in strata(DF, stratanames = "geo_ID", size = c(10, 10, 10, 10),  : 
  not enough obervations in the stratum 

通过替换进行采样可以避免出现错误method = "srswr",但这并不是理想的选择,因为它有时会为足够大以至于只有唯一样本抽取的分层抽取重复.

Sampling with replacement avoids the error, method = "srswr", but that's not ideal since it sometimes draws repeats for strata that are sufficiently large to have only unique sample draws.

注意:SO上也有类似的问题,但并未真正得到回答.我也认为这个问题更笼统. (分层抽样-观察次数不足)通常,链接问题的答案没有用因为它们需要(i)与层大小成正比的样本大小(而我需要一个固定的数字)或(ii)逐层手动编程,所以随着层数的增加是不可行的.

NOTE: There's a similar question on SO but it wasn't really answered. Also I think this question is more general. (Stratified sampling - not enough observations) The answers to the linked question are not generally useful since they require either (i) sample sizes proportional to the stratum size (whereas, I need a fixed number) or (ii) manually programming stratum-by-stratum, which isn't feasible as the number of strata increases.

推荐答案

这不能回答您有关如何使用采样"包进行此操作的问题,但是我已经写了

This doesn't answer your question about how to do this with the "sampling" package, but I've written a function called stratified that will do this for you.

如果安装了"devtools",则可以这样加载它:

If you have "devtools" installed, you can load it like this:

library(devtools)
source_gist(6424112)

否则,只需将功能代码从Gist复制到您的会话中,就可以玩得开心.

Otherwise, just copy the code of the function from the Gist into your session and have fun.

用法很简单:

set.seed(1) ## So you can reproduce this
stratified(DF, group = "geo_ID", size = 10)
# Some groups
# ---3, 4---
# contain fewer observations than desired number of samples.
# All observations have been returned from those groups.
#    geo_ID          V1        V2
# 7       1  1.51152200 2.3358481
# 9       1  2.01842371 2.9207286
# 14      1 -0.27878877 1.0464766
# 20      1  1.32011335 0.9002191
# 5       1  0.40426832 1.2727079
# :::SNIP:::
# 43      3  0.75816324 0.9967914
# 47      3 -0.81139318 1.5777441
# 55      3  0.08976065 0.3389009
# 51      3  0.32192527 1.9749074
# 48      4  1.44410126 1.8776498
# 44      4 -0.72670483 3.8484819
# 60      4  0.28488295 2.1372562
# 52      4 -0.78383894 2.1080727
# 56      4  0.27655075 1.6176663


有一些有趣"的功能,例如在函数本身中对阶层进行子集设置:


There are some "fun" features, like subsetting your strata in the function itself:

## Selects only "geo_ID" values equal to 1 or 4
stratified(DF, group = "geo_ID", size = 10, select = list(geo_ID = c(1, 4)))

...按比例取样:

## Just set the size argument to a value less than 1
stratified(DF, group = "geo_ID", size = .1)

...,并使用多列作为您的组.要点上的评论包括一些可以尝试的示例.

... and using multiple columns as your groups. The comments at the Gist include some examples to try out.

这篇关于某些地层太小时的分层样本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆