数据帧的分层随机抽样 [英] Stratified random sampling from data frame
问题描述
我有一个格式如下的数据框:
I have a data frame in the format:
head(subset)
# ants 0 1 1 0 1
# age 1 2 2 1 3
# lc 1 1 0 1 0
我需要根据年龄和 lc 使用随机样本创建新数据框.例如,我想要来自 age:1 和 lc:1 的 30 个样本,来自 age:1 和 lc:0 的 30 个样本等.
I need to create new data frame with random samples according to age and lc. For example I want 30 samples from age:1 and lc:1, 30 samples from age:1 and lc:0 etc.
我确实看过随机抽样方法,例如;
I did look at random sampling method like;
newdata <- function(subset, age, 30)
但这不是我想要的代码.
But it is not the code that I want.
推荐答案
我建议使用splitstackshape"包中的 stratified
或dplyr"中的 sample_n
" 包装:
I would suggest using either stratified
from my "splitstackshape" package, or sample_n
from the "dplyr" package:
## Sample data
set.seed(1)
n <- 1e4
d <- data.table(age = sample(1:5, n, T),
lc = rbinom(n, 1 , .5),
ants = rbinom(n, 1, .7))
# table(d$age, d$lc)
对于 stratified
,您基本上可以指定数据集、分层列和一个表示您希望从每个组中获得的大小的整数或一个表示您希望返回的分数的小数(例如,.1 表示每组 10%).
For stratified
, you basically specify the dataset, the stratifying columns, and an integer representing the size you want from each group OR a decimal representing the fraction you want returned (for example, .1 represents 10% from each group).
library(splitstackshape)
set.seed(1)
out <- stratified(d, c("age", "lc"), 30)
head(out)
# age lc ants
# 1: 1 0 1
# 2: 1 0 0
# 3: 1 0 1
# 4: 1 0 1
# 5: 1 0 0
# 6: 1 0 1
table(out$age, out$lc)
#
# 0 1
# 1 30 30
# 2 30 30
# 3 30 30
# 4 30 30
# 5 30 30
对于 sample_n
,您首先创建一个分组表(使用 group_by
),然后指定所需的观察次数.如果您想要按比例采样,则应使用 sample_frac
.
For sample_n
you first create a grouped table (using group_by
) and then specify the number of observations you want. If you wanted proportional sampling instead, you should use sample_frac
.
library(dplyr)
set.seed(1)
out2 <- d %>%
group_by(age, lc) %>%
sample_n(30)
# table(out2$age, out2$lc)
这篇关于数据帧的分层随机抽样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!