Bin列和汇总数据通过随机样本进行替换,以迭代地替换更大的Bin大小 [英] Bin columns and aggregate data via random sample with replacement for iteratively larger bin sizes

查看:86
本文介绍了Bin列和汇总数据通过随机样本进行替换,以迭代地替换更大的Bin大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面是一个示例矩阵:

mat<- matrix(c(1,0,0,0,0,0,1,0,0,0,0,0,0,0,2,0,
   2,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,
   0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,
   0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,
   0,0,0,0,1,0,0,1,0,1,1,0,0,1,0,1,
   1,1,0,0,0,0,0,0,1,0,1,2,1,0,0,0), nrow=16, ncol=6)
dimnames(mat)<- list(c("a", "c", "f", "h", "i", "j", "l", "m", "p", "q", "s", "t", "u", "v","x", "z"), 
          c("1", "2", "3", "4", "5", "6"))

我想对列进行分组或合并,然后汇总每个组的数据.首先,我想对两列数据进行装箱.合并的列必须彼此相邻(即第1和第2列,第5和第6列而不是第4和第6列).合并在矩阵中开始的位置是随机的.例如,在600列的矩阵中,合并的前两列可以是列545& amp;. 546及随后的第3& 4列.我想取样而不更换,这样就不会对组合进行两次取样.聚合定义为计算bin rowSums()的行总和.汇总结果将成为结果矩阵中的新列.结果矩阵中的列数将被限制为随机采样的bin数.

I want to group or bin columns and then aggregate data for each group. First, I would like to bin two columns of data. Binned columns must be adjacent to each other (ie. columns 1&2, columns 5&6 NOT columns 4&6). Where the binning starts in the matrix is random. For example, in a matrix of 600 columns the first two columns binned may be columns 545 & 546 and next columns 3&4. I would like to sample without replacement such that a combination is not sampled twice. Aggregation is defined as calculating row sums for the bin rowSums(). Aggregated results will be a new column in a result matrix. The number of columns in the result matrix will be limited to the number of bins randomly sampled.

容器的大小继续变得越来越大.接下来,bin大小增加到3,这样就将3个相邻的数据列聚合在一起.汇总的数据将放入不同的结果矩阵中.该过程将一直持续到bin达到数据帧的大小为止.所有结果矩阵都将放入矩阵列表中.

Bin size continues to get increasingly larger. Next, the bin size increases to 3 such that 3 adjacent columns of data are aggregated. Aggregated data will be put into a different result matrix. This process would continue until the bin is the size of the data frame. All result matrices would be put into a list of matrices.

我在此处针对替代分箱技术发布了类似的问题:将窗口方法移动到汇总数据

I have posted a similar question for an alternative binning technique here: Moving window method to aggregate data

我尝试修改代码,以便分箱技术随机采样n个相邻列并计算行总和:

I have tried modifying the code so that the binning technique randomly samples n adjacent columns and calculates row sums:

lapply(seq_len(ncol(mat) - 1), function(j) do.call(cbind, 
lapply(sample(ncol(mat)-j, replace = FALSE, size = length(x)), function(i) rowSums(mat[, i:(i + j)]))))

我需要帮助修改此行代码以随机抽样而不替换i个相邻的bin大小为i的列,用于n个样本,并使用行总和汇总每个样本.请注意,列组合不能重新采样,但如果列是新组合的一部分,则可以重新采样.

I need help modifying this line of code to randomly sample without replacement i adjacent columns of bin size i for n samples and aggregate each sample using row sums. Note that combinations of columns cannot be resampled but columns can be resampled if they are part of new combinations.

推荐答案

这是一种参数化方法,无需替换即可从可能的组合中采样并基于原始数据计算摘要,并标记结果列,以便您查看它们的来历来自(并且有信心不会重复).

Here's an paramaterized approach that samples from possible combinations without replacement and calculates the summary based on the original data, and labels the result columns so you can see where they came from (and have confidence there are not repeats).

set.seed(47)
n_cols_in_bin = 2
n_samps = 4

starting_cols = sample(1:(ncol(mat) -  (n_cols_in_bin - 1)), size = n_samps) 
result = sapply(starting_cols, function(x) rowSums(mat[, x:(x + n_cols_in_bin - 1)]))
colnames(result) = paste0("cols", starting_cols, "to", starting_cols + n_cols_in_bin - 1)
result
#   cols5to6 cols2to3 cols3to4 cols4to5
# a        1        2        0        0
# c        1        0        1        1
# f        0        1        1        0
# h        0        1        1        0
# i        1        2        1        1
# j        0        0        1        1
# l        0        0        0        0
# m        1        0        0        1
# p        1        0        0        0
# q        1        0        0        1
# s        2        0        0        1
# t        2        0        0        0
# u        1        0        0        0
# v        1        0        0        1
# x        0        1        0        0
# z        1        0        0        1

为方便起见,我们可以将其放在一个函数中:

For convenience, we can put it in a function:

foo = function(mat, n_cols_in_bin, n_samps) {
  starting_cols = sample(1:(ncol(mat) -  (n_cols_in_bin - 1)), size = n_samps)
  result = sapply(starting_cols, function(x)
    rowSums(mat[, x:(x + n_cols_in_bin - 1)]))
  colnames(result) = paste0("cols", starting_cols, "to", starting_cols + n_cols_in_bin - 1)
  result
}

foo(mat, n_cols_in_bin = 3, n_samps = 2)
#   cols3to5 cols4to6
# a        0        1
# c        1        2
# f        1        0
# h        1        0
# i        2        1
# j        1        1
# l        0        0
# m        1        1
# p        0        1
# q        1        1
# s        1        2
# t        0        2
# u        0        1
# v        1        1
# x        0        0
# z        1        1

这篇关于Bin列和汇总数据通过随机样本进行替换,以迭代地替换更大的Bin大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆