从数据帧进行分层随机抽样 [英] Stratified random sampling from data frame
问题描述
我有一个数据框,格式为:
I have a data frame in the format:
head(subset)
# ants 0 1 1 0 1
# age 1 2 2 1 3
# lc 1 1 0 1 0
我需要根据年龄和lc创建带有随机样本的新数据框.例如,我想要30个来自age:1和lc:1的样本,30个来自age:1和lc:0的样本,等等.
I need to create new data frame with random samples according to age and lc. For example I want 30 samples from age:1 and lc:1, 30 samples from age:1 and lc:0 etc.
我确实看过像这样的随机抽样方法;
I did look at random sampling method like;
newdata <- function(subset, age, 30)
但这不是我想要的代码.
But it is not the code that I want.
推荐答案
我建议使用"splitstackshape"程序包中的stratified
或"dplyr"程序包中的sample_n
:
I would suggest using either stratified
from my "splitstackshape" package, or sample_n
from the "dplyr" package:
## Sample data
set.seed(1)
n <- 1e4
d <- data.table(age = sample(1:5, n, T),
lc = rbinom(n, 1 , .5),
ants = rbinom(n, 1, .7))
# table(d$age, d$lc)
对于stratified
,您基本上可以指定数据集,分层列,以及一个整数,该整数表示每个组所需的大小,或者一个十进制表示要返回的分数(例如,.1表示每个组的10% ).
For stratified
, you basically specify the dataset, the stratifying columns, and an integer representing the size you want from each group OR a decimal representing the fraction you want returned (for example, .1 represents 10% from each group).
library(splitstackshape)
set.seed(1)
out <- stratified(d, c("age", "lc"), 30)
head(out)
# age lc ants
# 1: 1 0 1
# 2: 1 0 0
# 3: 1 0 1
# 4: 1 0 1
# 5: 1 0 0
# 6: 1 0 1
table(out$age, out$lc)
#
# 0 1
# 1 30 30
# 2 30 30
# 3 30 30
# 4 30 30
# 5 30 30
对于sample_n
,您首先创建一个分组表(使用group_by
),然后指定所需的观测值数量.如果要使用比例采样,则应使用sample_frac
.
For sample_n
you first create a grouped table (using group_by
) and then specify the number of observations you want. If you wanted proportional sampling instead, you should use sample_frac
.
library(dplyr)
set.seed(1)
out2 <- d %>%
group_by(age, lc) %>%
sample_n(30)
# table(out2$age, out2$lc)
这篇关于从数据帧进行分层随机抽样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!