如何通过多个单独的组使用 R 进行相同的字段聚合 [英] How to use R for same field aggregation by multiple separate group
问题描述
我正在尝试分别对几个(实际上是数百个)组(而不是所有组的所有组合)执行指标计数.我将通过简化示例来演示:
I'm trying to perform count of an indicator on several (actually hundreds) groups separately (NOT on all combinations of all groups). I'll demonstrate it by simplified example:
假设我有那个数据集
data<-cbind(c(1,1,1,2,2,2)
,c(1,1,2,2,2,3)
,c(3,2,1,2,2,3))
> data
[,1] [,2] [,3]
[1,] 1 1 3
[2,] 1 1 2
[3,] 1 2 1
[4,] 2 2 2
[5,] 2 2 2
[6,] 2 3 3
和一个指示器
some_indicator<-c(1,0,0,1,0,1)
然后我想在没有循环的情况下运行(例如按列应用),例如
then I want to run without loops (like apply by column) something like,
aggregate(some_indicator,list(data[,1]),sum)
aggregate(some_indicator,list(data[,2]),sum)
aggregate(some_indicator,list(data[,3]),sum)
这将产生以下结果:
[,1] [,2] [,3]
[1,] 1 1 0
[2,] 2 1 1
[3,] 0 1 2
即对于每一列(值子集在列之间变化不大),按值计算指标并合并.
i.e. for each column (values subset do not change much between columns), count the indicator by value and merge it.
目前我是用列循环写的,但我需要更有效的方法,因为有很多列并且需要一个多小时.
Currently I wrote it with a loop over columns, but I need much more efficient way, since there are lot of columns and It takes over an hour.
提前致谢,迈克尔.
推荐答案
1) tapply tapply
的第一个参数是 data
与每个列替换为 some_indicator
.第二个参数表示我们希望按数据中的组和列号进行分组.
1) tapply The first argument of tapply
is data
with each column replaced by some_indicator
. The second argument indicates that we wish to group by the groups in data and by the column number.
result <- tapply(replace(data, TRUE, some_indicator), list(data, col(data)), sum)
replace(unname(result), is.na(result), 0)
对于问题中显示的输入,最后一行给出:
For the input shown in the question, the last line gives:
[,1] [,2] [,3]
[1,] 1 1 0
[2,] 2 1 1
[3,] 0 1 2
1a) tapply 稍长的 tapply
解决方案如下.fun
将一列作为其参数,并使用 tapply
将该列作为组来对 some_indicator
中的组求和;但是,不同的列中可能有不同的组集,以确保它们都具有相同的组集(用于以后对齐),我们实际上按 factor(x, levs)
进行分组.sapply
将 fun
应用到 data
的每一列.as.data.frame
是必需的,因为 data
是一个矩阵,所以如果我们要应用 sapply
将应用于每个元素而不是每列到此为止.
1a) tapply A somewhat longer tapply
solution would be the following. fun
takes a column as its argument and uses tapply
to sum the groups in some_indicator
using that column as the group; however, different columns could have different sets of groups in them so to ensure that they all have the same set of groups (for later alignment) we actually groups by factor(x, levs)
. The sapply
applies fun
to each column of data
. The as.data.frame
is needed since data
is a matrix so sapply
would apply across each element rather than each column if we were to apply it to that.
levs <- levels(factor(data))
fun <- function(x) tapply(some_indicator, factor(x, levs), sum)
result <- sapply(as.data.frame(data), fun)
replace(unname(result), is.na(result), 0)
2) xtabs 这与 tapply
解决方案非常相似.它确实具有以下优点:(1) sum
由 xtabs
隐含,因此无需指定,并且 (2) 未填充的单元格填充为 0 而不是 NA 消除将 NA 替换为 0 的额外步骤.另一方面,我们必须使用 c
将公式的每个组件分解为一个向量,因为与 tapply
不同的是 xtabs
公式不接受矩阵:
2) xtabs This is quite similar to the tapply
solution. It does have the advantages that: (1) sum
is implied by xtabs
and so need not be specified and also (2) unfilled cells are filled with 0 rather than NA eliminating the extra step of replacing of NAs with 0. On the other hand we must unravel each component of the formula into a vector using c
since unlike tapply
the xtabs
formula will not accept matrices:
result <- xtabs(c(replace(data, TRUE, some_indicator)) ~ c(data) + c(col(data)))
dimnames(result) <- NULL
对于问题中的数据,这给出了:
For the data in the question this gives:
> result
[,1] [,2] [,3]
[1,] 1 1 0
[2,] 2 1 1
[3,] 0 1 2
REVISED 修改了 tapply
解决方案并添加了 xtabs
解决方案.
REVISED Revised tapply
solution and added xtabs
solution.
这篇关于如何通过多个单独的组使用 R 进行相同的字段聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!