根据计数数据将R采样到特定大小的组中 [英] R sampling into groups of specific size based on count data
问题描述
我想将df作为下一个,并想将大小/bin/group/sample分成大小为20的组.理想地,这种装箱"是指将其装箱.跨ID随机发生,而不是从上一行到下一行连续发生.
I want to take a df as the one below and want to cut/bin/group/sample into groups of size=20. Ideally, this "binning" occurs randomly across IDs rather then consecutively from top row to bottom row).
例如ID 2、29和71的计数为7、7、6,非常适合放入垃圾桶"中.大小= 20.我想实现垃圾箱的最小数量,而不关心ID的顺序(它们越随机,越好).
E.g. IDs 2, 29 and 71 have counts of 7,7,6 and would fit nicely into a "bin" of size=20. I want to achieve the minimum number of bins and do not care about order of IDs (the more random they are, the better).
set.seed(123)
df <- data.frame(
ID = as.numeric(1:100),
Count = as.numeric(sample(1:8, size = 100, replace = T)))
期望的结果将是如下所示的数据帧/小标题,具有最佳的随机采样并最小化bin数量.
Desired outcome would be a dataframe/tibble looking something like the below with optimum random sampling and minimising bin number.
Bin_size = 20是我设置的参数(理想结果是精确的20(= 20),但< 20可以,但是> 20不能).应该为每个Bin分配一个编号(例如,如果我有10个bin,我希望将它们称为Bin_number 1-10).
Bin_size=20 is the parameter set by me (the ideal outcome is exact 20 (=20) but <20 is ok, however, >20 is not ok). Each Bin should be given a number (e.g. if I have 10 bins, I would like them to be called Bin_number 1-10).
ID,计数,Bin_size,Bin_number
ID, Count, Bin_size, Bin_number
ID 2、7、20、1
ID 2, 7, 20, 1
ID 29、7、20、1
ID 29, 7, 20, 1
ID 71、6、20、1
ID 71, 6, 20, 1
等
其中7 + 7 + 6 = 20(等)
Where 7+7+6 = 20 (etc.)
在此方面的任何帮助将不胜感激.我一直想知道cumsum和group_by,但无法弄清楚.
Any help with this would be much appreciated. I have been wondering about cumsum and group_by but could not figure it out.
如果您需要更多详细信息,我们很乐意提供.谢谢!
if you need more details, I'm happy to provide them. thanks!
推荐答案
BBmisc
软件包具有一个简单的(虽然未优化)bin打包算法,该算法可能有用:
The BBmisc
package has a simple (though not optimized) bin packing algorithm that might be useful:
library(BBmisc)
library(dplyr)
df %>%
as_tibble() %>%
mutate(bin = binPack(Count, 20),
bin_size = ave(Count, bin, FUN = sum)) %>%
arrange(bin)
# A tibble: 100 x 4
ID Count bin bin_size
<dbl> <dbl> <int> <dbl>
1 11 4 1 20
2 17 8 1 20
3 27 8 1 20
4 22 4 2 20
5 42 8 2 20
6 56 8 2 20
7 34 4 3 20
8 62 8 3 20
9 79 8 3 20
10 40 4 4 20
# ... with 90 more rows
这篇关于根据计数数据将R采样到特定大小的组中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!