dplyr:样本数量大于总体数量 [英] dplyr: Sample size greater than population size
问题描述
我有一个数据框:
> class(dataset)
[1] "grouped_df" "tbl_df" "tbl" "data.frame"
> dim(dataset)
[1] 64480 39
我要从中采样50.000个样本
where I want to sample 50.000 samples from
> dataset %>% dplyr::sample_n(50000)
但总是给我错误
错误:样本大小(50000)大于总体大小(1)。您要替换= TRUE吗?
Error: Sample size (50000) greater than population size (1). Do you want to replace = TRUE?
但是例如有效的方法:
> dim(dataset[1] %>% dplyr::sample_n(50000))
[1] 50000 1
那为什么我的人口规模(1)
-与分组有关吗?
So why is my population size (1)
- does that have something to do with grouping?
推荐答案
是的,可能与分组有关。从 class(dataset)
的输出中可以看到,您的数据当前已分组(注意 grouped_df
信息),并且显然,一个或多个组的观测值太少,无法对50000个观测值进行采样而不进行替换。
Yes, it probably has to do with grouping. As you can see from the output of class(dataset)
your data is currently grouped (note the grouped_df
info) and one or more groups apparently have too few observations to sample 50000 observations without replacement.
要解决此问题,您可以在采样前取消数据分组:
To resolve this, you can either ungroup your data before sampling:
dataset %>% ungroup() %>% sample_n(50000)
或者您可以带有替换的样本:
Or you can sample with replacement:
dataset %>% sample_n(50000, replace = TRUE)
这篇关于dplyr:样本数量大于总体数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!