将数据帧拆分为确认性和探索性样本 [英] Splitting Dataframe into Confirmatory and Exploratory Samples

查看:45
本文介绍了将数据帧拆分为确认性和探索性样本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常大的数据框(N = 107,251),我希望将其分成相对相等的一半(~53,625).但是,我希望进行拆分以使三个变量在两组中保持相等的比例(关于性别、6 个级别的年龄类别和 5 个级别的区域).

I have a very large dataframe (N = 107,251), that I wish to split into relatively equal halves (~53,625). However, I would like the split to be done such that three variables are kept in equal proportion in the two sets (pertaining to Gender, Age Category with 6 levels, and Region with 5 levels).

我可以独立(例如,通过 prop.table(xtabs(~dat$Gender)))或组合(例如,通过 prop.table(xtabs(~dat$Gender + dat$Region + dat$Age)),但我不确定如何利用这些信息来实际进行采样.

I can generate the proportions for the variables independently (e.g., via prop.table(xtabs(~dat$Gender))) or in combination (e.g., via prop.table(xtabs(~dat$Gender + dat$Region + dat$Age)), but I'm not sure how to utilise this information to actually do the sampling.

样本数据集:

set.seed(42)
Gender <- sample(c("M", "F"), 1000, replace = TRUE)
Region <- sample(c("1","2","3","4","5"), 1000, replace = TRUE)
Age <- sample(c("1","2","3","4","5","6"), 1000, replace = TRUE)
X1 <- rnorm(1000)
dat <- data.frame(Gender, Region, Age, X1)

概率:

round(prop.table(xtabs(~dat$Gender)), 3)  # 48.5% Female; 51.5% Male
round(prop.table(xtabs(~dat$Age)), 3)     # 16.8, 18.2, ..., 16.0%
round(prop.table(xtabs(~dat$Region)), 3)  # 21.5%, 17.7, ..., 21.9%
# Multidimensional probabilities:
round(prop.table(xtabs(~dat$Gender + dat$Age + dat$Region)), 3)

这个虚拟示例的最终目标是两个数据框,每个数据框有大约 500 个观察值(完全独立,没有参与者出现在两者中),并且在性别/地区/年龄划分方面大致相同.在实际分析中,年龄和区域权重之间的差异更大,因此进行单个随机拆分是不合适的.在现实世界的应用中,我不确定是否需要使用每个观察值,或者让分割更均匀是否更好.

The end goal for this dummy example would be two data frames with ~500 observations in each (completely independent, no participant appearing in both), and approximately equivalent in terms of gender/region/age splits. In the real analysis, there is more disparity between the age and region weights, so doing a single random split-half isn't appropriate. In real world applications, I'm not sure if every observation needs to be used or if it is better to get the splits more even.

我一直在阅读 package:sampling 中的文档,但我不确定它的设计是否完全符合我的要求.

I have been reading over the documentation from package:sampling but I'm not sure it is designed to do exactly what I require.

推荐答案

你可以查看我的分层函数,你应该可以这样使用:

You can check out my stratified function, which you should be able to use like this:

set.seed(1) ## just so you can reproduce this

## Take your first group
sample1 <- stratified(dat, c("Gender", "Region", "Age"), .5)

## Then select the remainder
sample2 <- dat[!rownames(dat) %in% rownames(sample1), ]

summary(sample1)
#  Gender  Region  Age          X1          
#  F:235   1:112   1:84   Min.   :-2.82847  
#  M:259   2: 90   2:78   1st Qu.:-0.69711  
#          3: 94   3:82   Median :-0.03200  
#          4: 97   4:80   Mean   :-0.01401  
#          5:101   5:90   3rd Qu.: 0.63844  
#                  6:80   Max.   : 2.90422
summary(sample2)
#  Gender  Region  Age          X1          
#  F:238   1:114   1:85   Min.   :-2.76808  
#  M:268   2: 92   2:81   1st Qu.:-0.55173  
#          3: 97   3:83   Median : 0.02559  
#          4: 99   4:83   Mean   : 0.05789  
#          5:104   5:91   3rd Qu.: 0.74102  
#                  6:83   Max.   : 3.58466 

比较以下内容,看看它们是否在您的预期范围内.

Compare the following and see if they are within your expectations.

x1 <- round(prop.table(
  xtabs(~dat$Gender + dat$Age + dat$Region)), 3)
x2 <- round(prop.table(
  xtabs(~sample1$Gender + sample1$Age + sample1$Region)), 3)
x3 <- round(prop.table(
  xtabs(~sample2$Gender + sample2$Age + sample2$Region)), 3)

它应该能够很好地处理您描述的大小的数据,但是data.table"版本正在开发中,有望提高效率.

It should be able to work fine with data of the size you describe, but a "data.table" version is in the works that promises to be much more efficient.

stratified 现在有一个新的逻辑参数bothSets",它允许您将两组样本保存为list.

stratified now has a new logical argument "bothSets" which lets you keep both sets of samples as a list.

set.seed(1)
Samples <- stratified(dat, c("Gender", "Region", "Age"), .5, bothSets = TRUE)
lapply(Samples, summary)
# $SET1
#  Gender  Region  Age          X1          
#  F:235   1:112   1:84   Min.   :-2.82847  
#  M:259   2: 90   2:78   1st Qu.:-0.69711  
#          3: 94   3:82   Median :-0.03200  
#          4: 97   4:80   Mean   :-0.01401  
#          5:101   5:90   3rd Qu.: 0.63844  
#                  6:80   Max.   : 2.90422  
#
# $SET2
#  Gender  Region  Age          X1          
#  F:238   1:114   1:85   Min.   :-2.76808  
#  M:268   2: 92   2:81   1st Qu.:-0.55173  
#          3: 97   3:83   Median : 0.02559  
#          4: 99   4:83   Mean   : 0.05789  
#          5:104   5:91   3rd Qu.: 0.74102  
#                  6:83   Max.   : 3.58466

这篇关于将数据帧拆分为确认性和探索性样本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆