从数据集中抽样特定年龄分布 [英] Sampling a specific age distribution from a dataset

查看:181
本文介绍了从数据集中抽样特定年龄分布的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个包含1,000,000个观察值的数据集。变量是年龄,种族,性别。该数据集代表整个美国。

Suppose I have a dataset with 1,000,000 observations. Variables are age, race, gender. This dataset represents the whole US.

在给定年龄分布的情况下,如何从该数据集中抽取1,000个人的样本?例如。我希望有1000个人分布的数据集是这样的:

How can I draw a sample of 1,000 people from this dataset, given a certain age distribution? E.g. I want this datset with 1000 people distributed like this:

0.3 *年龄0-30

0.3 * Age 0 - 30

0.3 *年龄31-50

0.3 * Age 31 - 50

0.2 *年龄51-69

0.2 * Age 51 - 69

0.2 *年龄70-100

0.2 * Age 70 - 100

有快速的方法吗?我已经创建了一个具有所需年龄分布的1000个人样本,但是现在如何将其与原始数据集结合起来?

Is there a quick way to do it? I already created a sample of 1000 people with the desired age distribution, but how do I combine that now with my original dataset?

例如,这就是我创建缅因州人口分布的方式:

As an example, this is how I have created the population distribution of Maine:

set.seed(123)
library(magrittr) 

    popMaine <- data.frame(min=c(0, 19, 26, 35, 55, 65), max=c(18, 25, 34, 54, 64, 113), prop=c(0.2, 0.07, 0.11, 0.29, 0.14, 0.21))

    Mainesample <- sample(nrow(popMaine), 1000, replace=TRUE, prob=popMaine$prop)

    Maine <- round(popMaine$min[Mainesample] + runif(1000) * (popMaine$max[Mainesample] - popMaine$min[Mainesample])) %>% data.frame()

    names(Texas) <- c("Age")

现在,我不知道如何将其与拥有整个美国人口的其他数据集结合起来...我将不胜感激,现在已经停滞了好一段时间...

Now I don't know how to bring this together with my other dataset which has the whole US population... I'd appreciate any help, I am stuck for quite a while now...

推荐答案

以下是四种不同的方法。 splitstackshape sampling 包中的两个使用函数,一个使用基本的 mapply ,其中一个使用 purrr 包中的 map2 (属于 tidyverse 软件包集合)。

Below are four different approaches. Two use functions from, respectively, the splitstackshape and sampling packages, one uses base mapply, and one uses map2 from the purrr package (which is part of the tidyverse collection of packages).

首先让我们设置一些假数据和采样参数:

First let's set up some fake data and sampling parameters:

# Fake data
set.seed(156)
df = data.frame(age=sample(0:100, 1e6, replace=TRUE))

# Add a grouping variable for age range
df = df$age.groups = cut(df$age, c(0,30,51,70,Inf), right=FALSE)

# Total number of people sampled
n = 1000

# Named vector of sample proportions by group
probs = setNames(c(0.3, 0.3, 0.2, 0.2), levels(df$age.groups))

使用在上述抽样参数中,我们想从每个年龄组中按比例概率抽样 n 个总值。

Using the above sampling parameters, we want to sample n total values with a proportion probs from each age group.

mapply 可以将多个参数应用于函数。这里的参数是(1)将数据框 df 分成四个年龄组,以及(2) probs * n ,它给出了每个年龄段所需的行数:

mapply can apply multiple arguments to a function. Here, the arguments are (1) the data frame df split into the four age groupings, and (2) probs*n, which gives the number of rows we want from each age group:

df.sample = mapply(a=split(df, df$age.groups), b=probs*n, 
       function(a,b) {
         a[sample(1:nrow(a), b), ]
       }, SIMPLIFY=FALSE)

mapply 返回列表具有四个数据帧,每个层一个。将此列表合并为一个数据框:

mapply returns a list with of four data frames, one for each stratum. Combine this list into a single data frame:

df.sample = do.call(rbind, df.sample)

检查采样:

table(df.sample$age.groups)




[0,30)  [30,51)  [51,70) [70,Inf) 
   300      300      200      200




选项2:中的分层的函数splitstackshape



size 自变量需要一个命名向量,其中包含来自

Option 2: stratified function from the splitstackshape package

The size argument requires a named vector with the number of samples from each stratum.

library(splitstackshape)

df.sample2 = stratified(df, "age.groups", size=probs*n)



选项3:地层采样包中的code>函数



此选项迄今为止最慢。

Option 3: strata function from the sampling package

This option is by far the slowest.

library(sampling)

# Data frame must be sorted by stratification column(s)
df = df[order(df$age.groups),]

sampled.rows = strata(df, 'age.groups', size=probs*n, method="srswor")

df.sample3 = df[sampled.rows$ID_unit, ] 



选项4: tidyverse 软件包



map2 就像 mapply 一样,它在函数中并行应用两个参数,在这种情况下, dplyr 包的 sample_n 函数。 map2 返回四个数据帧的列表,每个层次一个,我们将其与 bind_rows 合并为一个数据帧

Option 4: tidyverse packages

map2 is like mapply in that it applies two arguments in parallel to a function, in this case the dplyr package's sample_n function. map2 returns a list of four data frames, one for each stratum, which we combine into a single data frame with bind_rows.

library(dplyr)
library(purrr)

df.sample4 = map2(split(df, df$age.groups), probs*n, sample_n) %>% bind_rows



< h3> Timings

Timings

library(microbenchmark)




Unit: milliseconds
       expr        min         lq       mean     median         uq       max neval cld
     mapply   86.77215  110.82979  156.66855  123.95275  145.25115  486.2078    10  a 
     strata 5028.42933 5541.40442 5709.16796 5699.50711 5845.69921 6467.7250    10   b
 stratified   38.33495   41.76831   89.93954   45.43525   79.18461  408.2346    10  a 
  tidyverse   71.48638  135.49113  143.12011  142.86866  155.72665  192.4174    10  a


这篇关于从数据集中抽样特定年龄分布的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆