有限制的分层抽样:固定的总大小在各组之间均匀分配 [英] Stratified sampling with restrictions: fixed total size evenly partitioned among groups

查看:62
本文介绍了有限制的分层抽样:固定的总大小在各组之间均匀分配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些分组数据,每个项目一行.我想按组进行分层抽样,有两个限制:(1)一定的总样本量;(2) 样本应尽可能均匀地分配给组(即组样本大小的最小 sd).

I have some grouped data with one row per item. I want to do a stratified sampling by group, with two restrictions: (1) a certain total sample size; (2) samples should be partitioned as evenly as possible among groups (i.e. minimal sd of the group sample sizes).

理想情况下,我们从每个组中选择相同(固定)数量的项目,当组大小为 >= 所需的 size 时,这是没有问题的组.但是,有时组大小小于 size.但是,项目总数始终高于总样本量.例如,总样本量为 12 个,有四个不同的组,我们理想情况下希望从每个组中挑选 3 个项目

Ideally, we pick the same (fixed) number of items from each group, which is no problem when the group size is >= the desired size for all groups. However, sometimes group size is less than size. The total number of items is always above the total sample size though. For example, with a total sample size of 12, and four distinct groups, we ideally want to pick 3 items from each group

size_tot <- 12
n_grp <- 4
size <- size_tot / n_grp

一些数据:

d2 <- data.table(id = 1:16,
                 grp = rep(c("a", "b", "c", "d"), c(9, 4, 2, 1)))
d2
#     id grp
#  1:  1   a
#  2:  2   a
#  3:  3   a
#  4:  4   a
#  5:  5   a
#  6:  6   a
#  7:  7   a
#  8:  8   a
#  9:  9   a
# 10: 10   b
# 11: 11   b
# 12: 12   b
# 13: 13   b
# 14: 14   c
# 15: 15   c
# 16: 16   d

我的原始逻辑是如果项目数量等于或大于 size,则从组中抽样 size 项目,否则只需从组".另请参阅此处这里此处.

My original logic was "if number of items is equal or larger to size, sample size items from the group, else just pick all items from the group". See also here, here and here.

set.seed(1)
d2[ , if(.N >= size) .SD[sample(x = .N, size = size)] else .SD, by = "grp"]

#    grp id
# 1:   a  3
# 2:   a  9
# 3:   a  5
# 4:   b 13
# 5:   b 10
# 6:   b 11
# 7:   c 14
# 8:   c 15
# 9:   d 16

在具有足够数量项目(a 和 b)的两组中,我们从每组中抽取了 3 个项目.在小组(c 和 d)中,我们只是挑选了所有的,即分别为 2 和 1.这导致总样本量为 9,即小于所需的总样本量 12.因此,我们需要从具有过剩项目的较大组中抽取额外项目以达到所需的总样本量.在这种情况下,所需的抽样将是来自b"的 1 个附加项目和来自a"的两个附加项目.

In the two groups with sufficient number of items (a and b), we sampled 3 items from each. In the small groups (c and d), we just picked all there was, i.e. 2 and 1 respectively. This results in a total sample size of 9, i.e. less than the desired total size of 12. Thus, we need to sample additional items from larger groups with a surplus of items to achieve the desired total sample size. In this case, the desired sampling would be 1 additional item from "b" and two additional items from "a".

以下是我对 sd 最低分区的看法.总样本大小可以分为四组,如下所示:

Here's how I thought of partitions with lowest sd. The total sample size can be partitioned into four groups like this:

library(partitions)
cmp <- compositions(n = size_tot, m = 4)

然后可以将分区从低 sd(组间样本大小相等 - 需要)到高 sd 排序:

The partitions can then be ordered from low sd (equal sample size among groups - desired) to high sd:

std <- apply(cmp, 2, sd)
cmp2 <- cmp[ , order(std)]

cmp2[ , 1:10]
#      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,]    3    4    3    3    4    3    4    2    3     2
# [2,]    3    3    4    3    3    4    2    4    2     3
# [3,]    3    3    3    4    2    2    3    3    4     4
# [4,]    3    2    2    2    3    3    3    3    3     3

以及团体人数:

d1[ , .(n = .N), by = "grp"]
#    grp n
# 1:   a 9
# 2:   b 4
# 3:   c 2
# 4:   d 1

但是如何将这个分区(总和为 12)与组样本大小(总和为 12)相匹配?有没有其他人在这里闻到 XY 问题?因此,是否有我忽略的替代方法?

But how to match this partition (which sums to 12) against the group sample sizes (which not necessarily sums to 12)? Does anyone else smell XY-problem here? Thus, are there alternative approaches which I have overlooked?

PS:我考虑过比例分配(按比例抽样),但是当组大小的分布足够偏斜时,这种抽样显然不尊重绝对总样本大小,并且不会在组之间均匀分布样本(例如 caret::createDataPartitionstrata::balancedstratification)

PS: I have considered proportional allocation (proportionate sampling), but when distribution of group sizes is sufficiently skewed, such sampling does obviously not respect the absolute total sample size and does not distribute samples evenly among groups (e.g. caret::createDataPartition and strata::balancedstratification)

推荐答案

我想你的答案已经差不多了.您只需要对 cmp2 进行过滤,即可获得满足采样大小小于或等于组大小的条件的第一个采样集:

I think your answer is almost there. You just need to filter on cmp2 to get the first sampling set that meets the criteria that the sampling sizes are lower or equal to the group sizes:

#Create a set of indices of sampling sizes that fit the criteria
original_groups <- d2[, .N, by = grp][,N]
valid_indexes <- apply(cmp2, 2, function(x) all(x <= original_groups))

#Take the first of these valid indices (lowest variance)
sampling_sizes <- cmp2[,which(valid_indexes)[1]]

#Create a sampling size variable on the datatable
d2[, sampling_size := rep(sampling_sizes, original_groups)]

#Sample as before
d2[ , .SD[sample(x = .N, size = sampling_size)], by = "grp"]

这篇关于有限制的分层抽样:固定的总大小在各组之间均匀分配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆