从大数据帧中采样小数据帧 [英] Sampling small data frame from a big dataframe

查看:93
本文介绍了从大数据帧中采样小数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从给定的数据帧中采样一个数据帧,以使每个变量级别都有足够的采样. 这可以通过将数据帧按级别和样本分开来实现. 我以为ddply(数据帧到数据帧)会帮我做到这一点. 举一个最小的例子:

I am trying to sample a data frame from a given data frame such that there are enough samples from each of the levels of a variable. This can be achieved by separating the data frame by the levels and sample from each of those . I thought ddply (data-frame to data-frame) would do it for me. Taking a minimal example:

set.seed(1)
data1 <-data.frame(a=sample(c('B0','B1','B2'),100,replace=TRUE),b=rnorm(100),c=runif(100))
> summary(data1$a)
B0 B1 B2 
30 32 38

以下命令执行采样...

The following commands perform the sampling...

当我输入...

data2 <- ddply(data1,c('a'),function(x) sample(x,20,replace=FALSE))

我收到以下错误

   Error in `[.data.frame`(x, .Internal(sample(length(x), size, replace,  : 
  cannot take a sample larger than the population when 'replace = FALSE'

此错误是因为ddply函数内的x不是矢量,而是数据帧.

This error is because x inside the ddply function is not a vector but a dataframe.

有人对如何实现此采样有任何想法吗? 我知道一种方法是不使用ddply,而是分三个步骤进行(1)隔离,(2)采样和(3)整理.但是我想知道必须以某种方式...使用base或plyr函数...

Does anyone have any idea on how to achieve this sampling? I know one way is to not use ddply and just do (1) segregation, (2) sampling, and (3) collation in three steps. But I was wondering there must by some way ...with base or plyr functions...

谢谢您的帮助...

推荐答案

我认为您想要的是使用sample子集在x中传递的数据帧的子集:

I think what you want is to subset the data frame passed in x using sample:

ddply(data1,.(a),function(x) x[sample(nrow(x),20,replace = FALSE),])

但是,当然,您仍然需要注意,根据a的水平,每件样本的大小(在这种情况下为20)至少与数据的最小子集一样大.

But, of course, you still need to take care that the size of the sample for each piece (in this case 20) is at least as big as the smallest subset of your data based on the levels of a.

这篇关于从大数据帧中采样小数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆